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Abstract Iteratively Re-weighted Least Squares (IRLS) is a method for solving minimization problems 
involving non-quadratic cost functions, perhaps non-convex and non-smooth, which however can be de¬ 
scribed as the infimum over a family of quadratic functions. This transformation suggests an algorithmic 
scheme that solves a sequence of quadratic problems to be tackled efficiently by tools of numerical linear 
algebra. Its general scope and its usually simple implementation, transforming the initial non-convex 
and non-smooth minimization problem into a more familiar and easily solvable quadratic optimization 
problem, make it a versatile algorithm. However, despite its simplicity, versatility, and elegant analysis, 
the complexity of IRLS strongly depends on the way the solution of the successive quadratic optimiza¬ 
tions is addressed. For the important special case of compressed sensing and sparse recovery problems 
in signal processing, we investigate theoretically and numerically how accurately one needs to solve the 
quadratic problems by means of the conjugate gradient (CG) method in each iteration in order to guar¬ 
antee convergence. The use of the CG method may significantly speed-up the numerical solution of the 
quadratic subproblems, in particular, when fast matrix-vector multiplication (exploiting for instance the 
FFT) is available for the matrix involved. In addition, we study convergence rates. Our modified IRLS 
method outperforms state of the art first order methods such as Iterative Hard Thresholding (IHT) or 
Fast Iterative Soft-Thresholding Algorithm (FISTA) in many situations, especially in large dimensions. 
Moreover, IRLS is often able to recover sparse vectors from fewer measurements than required for IHT 
and FISTA. 
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compressed sensing • sparse recovery. 


M. Fornasier 

Technische Universitat Miinchen, Fakultat fiir Mathematik, Boltzmannstrasse 3, D-85748, Garching bei Miinchen, Germany 
E-mail: massimo.fornasier@ma.tum.de 

S. Peter 

Technische Universitat Miinchen, Fakultat fiir Mathematik, Boltzmannstrasse 3, D-85748, Garching bei Miinchen, Germany 

Tel.: -h49-89-289-17482 

E-mail: steffen.peter@ma.tum.de 

H. Rauhut 

RWTH Aachen University, Lehrstuhl G fiir Mathematik (Analysis), Pontdriesch 10, D-52062, Aachen, Germany 
E-mail: rauhut@mathc.rwth-aachen.de 

S. Worm 

Schlofistr. 34, D-53115 Bonn, Germany 
E-mail: stephanworm@gmx.de 




1 Introduction 


1.1 Iteratively Re-weighted Least Squares 

Iteratively Re-weighted Least Squares (IRLS) is a method for solving minimization problems by trans¬ 
forming them into a sequence of easier quadratic problems which are then solved with efficient tools of 
numerical linear algebra. Contrary to classical Newton methods smoothness of the objective function is 
not required in general. We refer to the recent paper |3S] for an updated and rather general view about 
these methods. 

In the context of constructive approximation, an IRLS algorithm appeared for the first time in the 
doctoral thesis of Lawson in 1961 |32| in the form of an algorithm for solving uniform approximation 
problems. It computes a sequence of polynomials that minimize a sequence of weighted L^-norms. This 
iterative algorithm is now well-known in classical approximation theory as Lawson’s algorithm. In m 
it is proved that this algorithm essentially obeys a linear convergence rate. 

In the 1970s extensions of Lawson’s algorithm for .^T--norm minimization, and in particular .^i-norm 
minimization, were proposed. Since then IRLS has become a rather popular method also in mathematical 
statistics for robust linear regression [25]. Perhaps the most comprehensive mathematical analysis of the 
performance of IRLS for .^^-norm minimization was given in the work of Osborne [36] . 

The increased popularity of total variation minimization in image processing starting with the pio¬ 
neering work mi, significantly revitalized the interest in these algorithms, because of their simple and 
intuitive implementation, contrary to more general optimization algorithms such as interior point meth¬ 
ods. In particular, in PHD an IRLS for total variation minimization has been proposed. At the same 
time, IRLS appeared as well under the name of Kacanov method in |23] as a fixed point iteration for the 
solution of certain quasi-linear elliptic partial differential equations. In signal processing, IRLS was used 
as a technique to build algorithms for sparse signal reconstruction in |21j . After the pioneering work m 
and the starting of the development of compressed sensing with the seminal papers ISHHI, several works 
[munuiadi addressed systematically the analysis of IRLS for .^^-norm minimization in the form 

min \\x\\e^, (I) 

<Px—y 

where 0 < t ^ I, G ^mxN jg given matrix, and y G C™ a given measurement vector. In these papers, 
the asymptotic super-linear convergence of IRLS towards f,--norm minimization for r < 1 has been 
shown. As an extension of the analysis of the aforementioned papers, IRLS have been also generalized 
towards low-rank matrix recovery from minimal linear measurements [la¬ 
in recent years, there has been an explosion of papers on applications and variations on the theme of 
IRLS, especially in the engineering community of signal processing, and it is by now almost impossible to 
give a complete account of the developments. (Presently Scholar Google reports more than 3180 papers 
since 2010 containing the phrase “Iteratively Re-weighted Least Squares” and more than 100 with it in 
the title since 1970, half of which appeared after 2003.) 


1.2 Contribution of this paper 

Since it is based on a relatively simple reformulation of the initial potentially non-convex and non-smooth 
minimization problem (for instance of the type 0 ) into a more familiar and easily solvable quadratic 
optimization, IRLS is one of the most immediate and intuitive approaches towards such non-standard 
optimizations and perhaps one of the first and popular algorithms beginner practitioners consider for 
their first experiments. However, despite its simplicity, versatility, and elegant analysis, IRLS does not 
outperform in general well-established first order methods, which have been proposed recently for similar 
problems, such as Iterative Hard Thresholding (IHT) or Fast Iterative Soft-Thresholding Algorithm 
(FISTA) P, as we also show in our numerical experiments in Section In fact, its complexity very 
strongly depends on the way the solution of the successive quadratic optimizations is addressed, whether 
one uses preconditioned iterative methods and exploits fast matrix-vector multiplications or just considers 
simple direct linear solvers. If the dimensions of the problem are not too large or the involved matrices 
have no special structure allowing for fast matrix-vector multiplications, then the use of a direct method 
such as Gaussian elimination can be appropriate. When instead the dimension of the problem is large 
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and one can take advantage of the structure of the matrix to perform fast matrix-vector multiplications 
(e.g., for partial Fourier or partial circulant matrices), then it is appropriate to use iterative solvers such 
as the Conjugate Gradient method (CG). The use of CG in the implementation of IRLS is appearing, 
for instance, in |42j towards total variation minimization and in towards £i-norm minimization. 

However, the price to pay is that such solvers will return only an approximate solution whose precision 
depends on the number of iterations. A proper analysis of the convergence of the perturbed method 
in this case has not been reported in the literature. Without such an analysis it is impossible to give 
any estimate of the actual complexity of IRLS. Thus, the scope of this work is to clarify, specifically 
for compressed sensing problems (i.e., for matrices <1> with certain spectral properties such as the Null 
Space Property), how accurately one needs to solve the quadratic problems by means of CG in order to 
guarantee convergence and possibly also asymptotic (super-)linear convergence rates. 

Besides analyzing the effect of GG in an IRLS for problems of the type Q, we further extend it in 
Section to a class of problems of the type 

mm\\<Px-yf^^+2a\\x\\e^, ( 2 ) 

for 0 < r ^ 1, used for sparse recovery in signal processing. In the work nmcciii] a convergence analysis 
of IRLS towards the solution of (|^ has been carried out with two limitations: 

(i) In |31j the authors do not consider the use of an iterative algorithm to solve the appearing system of 
linear equations and they do not show the behavior of the algorithm when the measurements y are 
given with additional noise; 

(ii) Also in [431l44j a precise analysis of convergence is missing when iterative methods are used to solve 
the intermediate sequence of systems of linear equations. Also the non-convex case of t < I is not 
specifically addressed. 

Regarding these gaps, we contribute in this work by 

— giving a proper analysis of the convergence when inaccurate CG solutions are used; 

— extending the results of convergence in [43ll^ to the case of 0 < r < 1 by combining our analysis 
with findings in [551l45j : 

— performing numerical tests which evaluate possible speedups via the GG method, also taking problems 
into consideration where measurements may be affected by noise. 

Our work on CG accelerated IRLS for (§ does not analytically address rates of convergence because this 
turned out to be a very technical task. 


We illustrate the theoretical results of this paper described above by several numerical experiments. 
We first show that our versions of IRLS yield significant improvements in terms of computational time 
and may outperform state of the art first order methods such as Iterative Hard Thresholding (IHT) [3] 
and Fast Iterative Soft-Thresholding Algorithm (FISTA) [1], especially in high dimensional problems 
{N ^ 10®). These results are somehow both surprising and counterintuitive as it is well-known that 
first order methods should be preferred in higher dimension. However, they can be easily explained by 
observing that in certain regimes preconditioning in the conjugate gradient method (as we show at the 
end of Subsection 5.31 turns out to be extremely efficient. This is perhaps not a completely new discovery, 
as benefits of preconditioning in IRLS have been reported already in minimization problems involving 
total variation terms [32]. The second significant outcome of our experiments is that CG-IRLS not only 
is faster than state of the art first order methods, but also shows higher recovery rates, i.e., requires 
less measurements for successful sparse recovery. This will be demonstrated with corresponding phase 
transition diagrams of empirical success rates (Figure]^. 


1.3 Outline of the paper 

The paper is organized as follows: In Section]^ we introduce definitions and notation and give a short 
review on the CG method. Although this brief introduction on CG retraces very well-known facts of the 
numerical linear algebra literature, it is necessary for us for the sake of a consistent presentation also in 
terms of notation. We hope that this small detour will help readers to access more easily the technical 
parts of the paper. In Section we present the IRLS method tailored to problems of the type Q and 
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its modification including CG for the solution of the quadratic optimizations. We present a detailed 
analysis of the convergence and rate of convergence. The approach is further extended to problems of 
type ([^ in Section]^ where we also analyze the convergence of the method. We conclude with numerical 
experiments in Section [^showing that the modifications to IRLS inspired by our theoretical results make 
the algorithm extremely efficient, also compared to state of the art first order methods, especially in high 
dimension. 


2 Definitions, Notation, and Conjugate Gradient method 

In this section, we introduce the main terms and notation used in this paper. In addition to this, we 
shortly review the basics around the Conjugate Gradient method. For a more detailed introduction to 
conjugate gradient methods, we refer to respective text books, e.g., [MUST]. In order to simplify cross¬ 
reading, we use the same notation as in [ 16 j . 

For matrices <I> € and y € C"*, we define 

^^(y) := {z € I <Pz = yj , (3) 

:=ker<P= {z €C^ I'Pz = 0} . ( 4 ) 

Unless noted otherwise, we denote with the adjoint (conjugate transpose) matrix of a matrix Thus, 
in the particular case of a scalar, x* denotes the complex conjugate of a: G C. 

Definition 1 (Weighted .^p-spaces) We define the quasi-Banach space (w) := , || • ||^p(iu)) en¬ 

dowed with the weighted quasi-norm 

/ N 

\i=l 

for a weight vector w G with positive entries and 0 < p < oo. Furthermore, we define the £^-spaces 
by setting := {!), where 1 denotes the weight with entries identically set to 1. Below we may ignore 

the superscript indicating the dimension N, when it is clear from the context, so that we write ip = ip 
or ip{w) = ip (w). The space i^ {w) is a Hilbert space endowed with the weighted scalar product 

N 

{x,y)e2iw) = ^x,y*Wi. 

i=l 

In the unweighted case ic = 1 it reduces to the standard complex scalar product (•, •)^ 2 - 
For <P G we define the norm 



:= sup \\<Px\\^^ , 

V 

and for the particular case of p = g = 2, ||^|| := is the standard operator norm and can be 

given explicitly by 

PI! = 

where Amax(’) denotes the largest eigenvalue of a square matrix (compare Definition [^. 

Definition 2 (K-sparse vector) A vector x G is called K-sparse for AT G N, AT ^ N, if the number 

7 ^ 0} of its non-zero entries does not exceed AT. 


Definition 3 (Nonincreasing rearrangement) The nonincreasing rearrangement r{x) of the vector 
X G is defined by r(x) := (|a;ij,..., |) with |a;q-1 ^ la^q+i I for g = 1,..., N — 1 and where j i—>■ ij 

is a permutation of {1,. .., N}. Furthermore, the best K-term approximation error aK{x)e in ir is given 

by 

N 




inf \\x 

, X-sparse 



X! lG(a^)r> 0 < r < oo. 

j=K-\-l 


4 




In this paper we restrict our attention to optimization problems of the type Q for matrices ^ S (j^mxAr 
for m ^ N having full rank, i.e., rank(^) = m, and certain spectral properties. Such matrices are used in 
the practice of compressed sensing and we refer to |20j for more details. The following notion has been 
introduced in [iniiiiiiiiiiiiisiiis]. 

Definition 4 (Null Space Property (NSP)) A matrix G satisfies the Null Space Property 

of order K for 7 ^ > 0 and hxed 0 < r ^ 1 if 

hT\\l<lKhT4l. ( 5 ) 

for all sets T C {l,...,iV} with < K and all p G ker^\{0}. We say in short that has the 
(if, 7 K)-NSP. 

We give an important consequence of the NSP mm. m Lemma 7.6]. 

Lemma 1 Assume that <P G satisfies the {K,^k)-NSP for 0 < t ^ 1. Then for any vectors 

z, z' G it holds 

11 ^' - • 

It follows immediately from this lemma that the solution x'^ of .^T--minimization Q run ony = <Px satisfies 
\\x^ — x\\4 ^ ^K{z)t-r- Another consequence is the following statement, see m Lemma 4.3] for 

the case r = 1 . 

Lemma 2 Assume thatT> has the {K,^k)-NSP ([^. Suppose that P^ijf) contains a K-sparse vector x*. 
Then this veetor is the unique i-r-minimizer in iF,p{y). Moreover we have for all v G 

\\v-x*\\^^<2^-^^aK{v)e^. ( 6 ) 

t — 7ic 

It is well-known that the NSP for 0 < r ^ 1 can be shown via the restricted isometry property HU 
[ 20 ] . but also direct proofs of the NSP are available for certain random matrices giving often better con¬ 
stants and working under weaker assumptions [8l ll7[[20ll281l33j . In particular, Gaussian random matrices 
satisfy the NSP of order K with high probability if m ^ CK\og{K/N). Structured random matrices 
including random partial Fourier and discrete cosine matrices, and partial random circulant matrices - 
both important in applications - satisfy the RIP and hence, the NSP with high probability provided that 
m ^ C'Ariog"‘(7V) [71l20LI30[l39lHn] . Note that for these types of structured matrices, fast matrix vector 
multiplication routines are available. 

Definition 5 (Set of eigenvalues and singular values) We denote with A{A) the set of eigenvalues 
of a square matrix A. Respectively, Amin (A) and Amax(A) are the smallest and largest eigenvalues. We 
define by o'min(A) and crmax(A) the smallest and largest singular value of a rectangular matrix A. 


2.1 Conjugate gradient method (CG) 

The CG method was originally proposed by Stiefel and Hestenes in [24] and generalized to complex 
systems in m- For an Hermitian and positive definite matrix A € solves the 

linear equation Ax = y or equivalently the minimization problem 

argmin (F{x) := -x*Ax — x*yj . 
xgC™ V 2 J 

The algorithm is designed to iteratively compute the minimizer x* of F on the affine subspace Vi := x'^+Vi 
with Vi being the Krylov subspace Vi := span{r°, Ar°,..., C C^, x° G a starting vector, 

and r^ := y — Ax^ {minimality property of CG). 
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Algorithm 1 Conjugate Gradient (CG) method 

Input: initial vector x'^ £ C^, matrix A G given vector y G and optionally a desired accuracy 

< 5 . 

1: Set = y — Ax^ and i = 0 

2: while r* yl: 0 (or Ur'll^ > S) do 

3 : ai = {r\f)i^l{Ap\p'^)t^ 

4: a;®+l = x* + Uip* 

5: = y — j4x®+^ 

6: bi+i = {Ap®,r®+l)^2/{Ap®,p®)^2 

7: p®+l = r®+l - 6i+ip® 

8: i = i + \ 

9: end while 


Roughly speaking, CG iteratively searches for a minimum of the functional F along conjugate direc- 
tions p‘ with respect to A, i.e., [p'')*AjP = 0, j < i. Thus, in step i + 1 of GG the new iterate is 
found by minimizing Fipf + atp^) with respect to the scalar at G M. along the search direction p®. Since 
we perform a minimization in each iteration, this implies monotonicity of the iterates, ^ F{x^). 

If the algorithm produces at some iteration a residual F = 0, then a solution of the linear system is 
found. Otherwise it produces a new conjugate direction p®. One can show that the conjugate directions 
p°,... ,p®“^ also span Vi. Since the conjugate directions are linear independent, we have Vn = (as¬ 
sumed that r® 7 ^ 0, i = 0,..., N — 1). Then, according to the above mentioned minimality property, the 
iterate x^ is the minimizer of F on C^, which means that GG terminates after at most N iterations. 
Nevertheless, the algorithm can be stopped after a signihcantly smaller number of steps as soon as the 
machine precision is very high and theoretically convergence already occurred. In view of propagation of 
errors in practice the algorithm may be run longer than just N iterations though. 

The following theorem establishes the convergence and the convergence rate of GG. 


Theorem 1 f |37L Theorem 4.12]) Let the matrix A be Hermitian and positive definite. The Algo¬ 
rithm CG eonverges to the solution of the system Ax = y after at most N steps. Moreover, the error 
x'' — X is such that 


A^x^-x) 




1 + c2® 


A2{x^-x) 


t2 


with CA 


y/iCA - 1 
\/VA + 1 


< 1 , 


where ka = is the condition number of the matrix A and crmax(^) (resp. (Tinin(^)^ is the largest 

(resp. smallest) singular value of A. 


Remark 1 Theorem is slightly modihed with respect to the formulation in [37) . There, the matrix A is 
considered to be symmetric instead of being Hermitian. However, in the complex case, the proof can be 
performed similarly by replacing the transpose by the conjugate transpose. 


Remark 2 Since ka ^ 1, it follows that 0 ^ ca < 1, and also 0 ^ < 1, for positive iteration numbers 

i. From 0 < (1 — c\)'^ = 1 -I- c^® — 2c\, we immediately see that 2c)4/(l -I- c^®) < 1 for all i S N, and 
obviously -I- c^®) —^ 0 for z —>• -|-oo. 


2.2 Modified conjugate gradient method (MCG) 

In Section we are interested in a vector which solves the weighted least-squares problem 

X = argmin ||a;||^ 2 (»«)> 


given (L G with m ^ A. As we show below in Section 3.1 the minimizer x is given explicitly by 

the (weighted) Moore-Penrose pseudo-inverse 

X = D<P*(<PD<P*)-^y, 
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_l N 

where D := diag [w^ Hence, in order to determine i, we first solve the system 

<PD^*e = y, 

and then we compute x = D<1>*9. Notice that the system Q has the general form 

TT*e = y, (8) 

with T := (1>D^. We consider the application of CG to this system for the matrix A = TT*. This approach 
leads to the modified conjugate gradient (MCG) method, presented in Algorithm and proposed by 
J.T. King in [^H]- It provides a sequence with 0® e Ui := span{?/, TT* 2 /,..., the 

Krylov subspace associated to with the property that a;® := T*9^ minimizes ||x® —where 

X = argmin ||a;|||^. Finally, we compute x = D^x. 


Algorithm 2 Modified conjugate gradient (MCG) method 

Input: initial vector 9^ S C"*, T g y G C™, desired accuracy <5 (optional). 

1: Set = 3 / and i = 0 

2: while p® 0 (or ||p®||£ > S) do 

3: ai = (pbA}t,/||rV||2^ 

4: = 6»® + aip* 

5: p®+i = 

6: ft+i = (T*p®,r*p®+l},,/||r*p®|||^ 

7: p®+l = p»+l - ft+ip® 

8: i = i 1 

9: end while 

10: Set S®+1 = r*6»®+l 


The following theorem provides a precise rate of convergence of MCG. Addit ion ally, we emphasize 

the monotonic decrease of the error \\Sf — i: L . ,, which we use below in Lemma 11 

II ll^alJc)’ [_J 

Theorem 2 Suppose the matrix T to he surjective. Then the sequence (ai®)igN generated by the Algo¬ 
rithm MCG converges to x = T* (TT*)~^y in at most N steps, and 


9 c® 

1-7 - 11 ^ 

-^IIa ^ 1 + c^® 


|a;^ — , with ctt* < 1, 


(9) 


B 


for all i ^ 0, where ctt*= = a-max(T) <rmin(T) ^ defined as in TheoremlA and x^ = T*9^ 

■' ^ ^ ^ yG(TT^+l <rmax(T)+o'„i„(T) -I I f 

is the initial vector. Moreover, by setting D ■= diag and xS = D^x'' as well as x = D^x, we 

obtain 

9c® 

^CTT* II-0 _ - II 




(w) ^ 1 _|_ 


'TT* 


Proof By Theorem we have 

{TT*)^{9^ -9) 
for 9 as given in (1^. By the identity 


t2 i- “T 


^2 


( 10 ) 


(rr*)5(6i® -9) = ((TT*)5(6I® - 9), (TT*)5(6»® - 9))e, = {{TT*){9^ - 9), 9^ - 9)i^ 

^2 

= {T*{9^-9),T*{9^-9))^, = {x^-x,x^-x)i, = \\x^ - x\\l, 

we obtain the assertion Inequality ( [l0| ) follows then from the definition of the diagonal matrix D 
and the weighted norm £ 2 (it)- The fact that the coefficient 2cy-j-,/(l + < 1 for all i € N, and 

2c\,j,,/{l + c^rp,) —1 0 for j —i 00 follows as in Remark]^ 
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3 Conjugate gradient acceleration of the IRLS method for fT--minimization 

In this section, we start with a detailed introduction of the IRLS algorithm and its modified version 
that uses CG for the solution of the successive quadratic optimization problems. Afterwards, we present 
two results providing the convergence and the rate of convergence of the modified algorithm. As crucial 
feature, we give bounds on the accuracies of the (inexact) CG solutions of the intermediate least squares 
problems which ensure convergence of the overall IRLS methods. In particular, these tolerances must 
depend on the current iteration and should tend to zero with increasing iteration count. In fact, without 
this condition, one may observe divergence of the method. The proofs of the theorems are developed into 
several lemmas. 

From now on, we consider a fixed parameter r such that 0 < r ^ 1. At some points of the presentation, 
we explicitly switch to the case of r = 1 to prove additional properties of the algorithm which are due 
to the convexity of the fi-norm minimization problem. 


3.1 Iteratively Re-weighted Least Squares (IRLS) algorithm for ^^--minimization 

The following functional turns out to be a crucial tool for the analysis of the IRLS algorithm and its 
modified variant. 


Definition 6 Given a real number £ > 0, x G , and a weight vector w G with positive entries 
Wj > 0, j = 1,... ,N, we define 


Jr {x,W,£) 


T 

2 


N 

i=i 




( 11 ) 


The standard IRLS algorithm for ^T-minimiziation is intuitively motivated in |16j by means of a 
weighted least squares approximation of the -minimization problem. However, for the sake of a concise 
presentation, we introduce below the algorithm directly as an alternating minimization procedure of the 
functional Jr with respect to the three variables x, w, and e. Algorithm recalls the formulation of 
IRLS, as appearing in [Tin Section 7.2], or [5^1 Ghapter 15.3]. For the sake of notational clarity, we use 
the nonincreasing rearrangement r(-), as introduced in Definitionat step 3. 


Algorithm 3 Iteratively Re-weighted Least Squares (IRLS) 

Set 

10° 1), £° 

:= 1 


1: 

while s’* ^ 0 do 



2: 

^n + 1 ^j,g 

Jr(x,w",e") = argmin 



xeF.l,(y) 

xGJF^(y) 


3: 

e"+l := min(£", 

N ^ 


4: 

,^n+i argmin 

Jr(x"+\w,e"+^), i.e., w"+^ = [|a:"+l|2 + (e"+l)2]-^, 

II 


tt;>0 

J J 


5: 

end while 




The convergence of IRLS is by now well-established and we refer to [16] and [20l Section 15.3] for 
details, which we in part extend in our analysis in Section |3.3| 

In this section we propose a practical method to solve approximatively the least squares problems 
appearing in step of Algorithm The following characterization of their solution turns out to be very 
useful. Note that the ^ 2 ('a')-iiorm is strictly convex, therefore its minimizer subject to an affine constraint 
is unique. 


Lemma 3 ( |16L (2.6)], |20l Proposition A.23]) 

J7f(y) and 


We have x = argmin if and only if x G 

xe^F^iy) 


{x, rfj^ = 0 for all rj G Af,p. 


( 12 ) 
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By means of Lemma we are able to derive an explicit representation of the weighted ^ 2 -ininimizer 


X := argmm 


\^\\i 2 {w)- Define D := diag [{wj) ( |12[ ), we have the equivalent formulation 


D-^x e 7^(^>*), 


where Tl{-) denotes the range of a linear map. Therefore, there is a ^ S such that x = To 

compute we observe that 

y = ^x= 

and thus, since <P has full rank and is invertible, we conclude 

X = = D^*{<PD^*)-^y. 

As a consequence, we see that at stepj^of Algorithm IRLS the minimizer of the least squares problem 
is explicitly given by the equation 

x^+^ =Dr,<P*{<PD^<P*r"y, (13) 


where we introduced the N x N diagonal matrix 

:= diag • 

Furthermore, the new weight vector in step of Algorithm IRLS is explicitly given by 

w;+^ = [\x]+^\^ + {e-+^r]-^-^, j = l,...,N. (14) 

Taking into consideration that Wj > 0, this formula can be derived from the first order optimality con¬ 
dition 9j7r(x"+^, w, e"'''"^)/9w = 0. 


3.2 The algorithm CG-IRLS 

Instead of solving exactly the system of linear equations (13) occurring in step of algorithm IRLS, 
we substitute the exact solution by the approximate solution provided by the iterative algorithm MCG 
described in Section 2.2 We shall set a tolerance tol„+i, which gives us an upper threshold for the 
error between the optimal and the approximate solution in the weighted £ 2 “iiorm. In this section, we 
give a precise and implementable condition on the sequence (toln)„gN of the tolerances that guarantees 
convergence of the modified IRLS presented as Algorithm]^ below. 


Algorithm 4 Iteratively Re-weighted Least Squares combined with CG (GG-IRLS) 

Set 1), := 1, /3 S (0,1] 

1: while e" ^ 0 do 

2: Compute by means of MCG s.t. ||x"+l — < tol„+i, where := argmin e") = 

argmin ||^|b (w'"-) - Use the last iterate 0"’* corresponding to x" = from MCG of the previous IRLS iteration 

as initial vector 6® = for the present run of MCG. 

3: £’"+1 := min(£",,3r(i"+l)K+i) 

4: := argmin , w, £”+^), i.e., -|- 3^ , j = I,..., N 

w>0 ^ ^ 

5: end while 


In contrast to Algorithm IRLS, the value (3 in step is introduced to obtain flexibility in tuning 
the performance of the algorithm. While we prove in Theorem convergence for any positive value of 

j3, Theorem 3 iii) below guarantees instance optimality only for /3 < ^ j ^ in the case that 

lim £"■ ^ 0. Tvfevertheless in practice, choices of /3 which do not necessarily fulfill this condition may 

n—yoo 

work very well. Section investigates good choices of /3 numerically. 
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From now on, we fix the notation for the exact solution in stepj^of Algorithmic and for 

its approximate solution in the i-th iteration of Algorithm MCG. We have to make sure that — 

~n+i,j ||2 is sufficiently small to fall below the given tolerance. To this end, we could use the bound 




on the error provided by (10), but this has the following two unpractical drawbacks: 

(i) The vector x = 4"“'"^ is not known a priori; 

(ii) The computation of the condition number ctt* is possible, but it requires the computation of eigen¬ 
values with additional computational cost which we prefer to avoid. 

Hence, we propose an alternative estimate of the error in order to guarantee ||£ 2 (u,r.) < 

tol„+i. We use the notation of Algorithm MCG, but add an additional upper index for the outer IRLS 
iteration, e.g., is the 0* in the n + 1-th IRLS iteration. After i steps of MCG, we have 

We use 6»”+b* = - p"+b*) from stepjCof MGG to obtain 


max I \xA^ -|- 


I n+l .^||2 < ^ 




n+l,i II2 


(<?)' 


The last inequality above results from Amin = cr^m {^Dn^ and 

CTmin ^CTmin {'^) Cmin ^ (£”)^“^0-min (^) ■ 

Since e” and i" are known from the previous iteration, and ||p"^^’*||f 2 i® explicitly calculated within 
the MCG algorithm, achieved by iterating until 




m 


l+maxf^)') ||<l>||2 


-tol. 


n+1 ■ 


(15) 


Gonsequently, we shall use the minimal i S N such that the above inequality is valid and set 5"“'"^ := 
^n+i,i^ which will be the standard notation for the approximate solution. 


In inequality (15), the computation of Umin (^) and ||<?|| is necessary. The computation of these con¬ 
stants might be demanding, but has to be performed only once before the algorithm starts. Furthermore, 
in practice it is sufficient to compute approximations of these values and therefore these operations are 
not critical for the computation time of the algorithm. 


3.3 Gonvergence results 


After introducing Algorithm CG-IRLS, we state below the two main results of this section. Theorem 
shows the convergence of the algorithm to a limit point that obeys certain error guarantees with respect 
to the solution of ([^. Below K denotes the index used in the e-update rule, i.e., stepj^ of Algorithm GG- 
IRLS. 

Theorem 3 Let 0 < r ^ 1. Assume K is such that L' satisfies the Null Space Property ([^ of order K, 
with 7 < 1. //tol„+i in Algorithm CG-IRLS is chosen such that 


\/tol„+i < 



Cn 

T’ 


(16) 
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where 


Cn ■■= 2Wn (||^”||^ 2 („r.-i) + Vtol„j , with 


Wn ■■ = 




max i" ^ 2“"^ + (£”“1)2—^ 

1 1 

(£„)2-r ’ ■- 

D 2 r)2 


(17) 

(18) 


for a sequence (a„)„gN, which fulfills ^ 0 for all n S N, and On < oo, then, for each y S C™, 

2—0 

Algorithm CG-IRLS produces a non-empty set of accumulation points ZT^y). Define e := lim e”, then 

n—>-oo 

the following holds: 

(i) If e = 0, then Zr{y) consists of a single K-sparse vector x, which is the unique ir-minimizer in 
J-,p{y). Moreover, we have for any x S J-${y): 

,1 + 7 


I 2 ; — a;||J <ciaK{x)t^, withci'.= 2 


1 - 7 ' 


(19) 


(ii) If £ > 0, then for each x G Zr{y) ^ 0, we have {x,ri),j^^^ ^ = 0 for all rj G Af^,, where w(x,£,t) = 


2 -x-,Af 

= .|2 , ^ 2 | 2 


\Xi 


i=l 


. Moreover, in the case of t = 1, x is the single element of Zr{y) and x = 


N 


:= argmin ^ |a;| +e ^|2 (compare (|42|J. 

x£F^(y) j=l 


N 


(Hi) Denote by Xg^riv) the set of global minimizers of f,;^r{x) := ^ \Xj +£^|2 on J-,p{y). If £ > 0 and 


j,2 ^2 \ 

1=1 

( 1-7 k+i-G 


X G Zriy) n Xg^r{y), then for each x G IF${y) and any fi < , we have 

1 + 7 / 2 


Af/3^ 


£ < C 2 crk{x)i^, with C 2 := p- — 

" 1-7 \ 1 


K+l-k 

1+7 

K+l-fe 1-7 , 


Remark 3 Notice that ( |16[ ) is an implicit bound on tol„+i since it depends on which means that in 
practice this value has to be updated in the MCG loop of the algorithm. To be precise, after the update of 
^n+i,j+i g^gp ^ Algorithmj^we compute j"+i.®+i = rp*Qn+i,t+i each iteration i of the MCG loop. 

If i"+i++i is TC-sparse for some iteration i, then = £"+i++i — min je",/3r = 0 

and tol„+i = 0 by ( [T7| ) and (I 8 I. In this case, MCG and IRLS are stopped by definition. The usage of 
this implicit bound is not efficient in practice since the computation of r(i"'“''^’*'''^)/f+i requires a sorting 
of N elements in each iteration of the MCG loop. While the implicit rule is required for the convergence 
analysis of the algorithm, we demonstrate in Section [5.2| that an explicit rule is sufficient for convergence 
in practice, and more efficient in terms of computational time. 


Knowing that the algorithm converges and leads to an adequate solution, one is also interested in how 
fast one approaches this solution. Theorem states that a linear rate of convergence can be established 
in the case of r = 1. In the case of 0 < r < 1 this rate is even asymptotically super-linear. 


Theorem 4 Assume <1 satisfies the NSP of order K with constant 7 such that 0 < 7 < 1 — , and that 

J-${y) contains a k-sparse vector x*. Define A := supp(a;*). Suppose that k < K — and 0 < < 1 

are such that 


fi := 


R* := I i/min \x*j 
jeA 


7 ( 1 + 7 ) 


(1 — vfiG-D ^min 


r(l-T) 


1 + 


{N-k)fi^ 
K + l-k 


< 1 , 


fi{R*Y-^ < 1 , 


( 20 ) 
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for some fi satisfying fx < fi < 1. Define the error 

E^:= 

Assume there exists Uq such that 

Eno < R*- 

If On+i and toln+i are chosen as in Theorem^with the additional bound 


toL 


•n+1 


< 


then for all n ^ no, we have 
and 


(NC)-^ J 

En+l < h-En ^ + {NCy 2 (tol„+i) 2 . 


E, 


n+l 


< flE^ 


2-r 
n ? 


( 21 ) 

( 22 ) 

(23) 

(24) 

(25) 


where C ■= 3 anJ t Consequently, 3C converges linearly to x* in the case of t = 1. 

n—1 

The convergence is super-linear in the case o/O < r < 1. 


Remark 4 Note that the second bound in (23), which implies (25), is only of theoretical nature. Since 


the value of En is unknown it cannot be computed in an implementation. However, heuristic choices of 
tol„+i may fulfill this bound. Thus, in practice one can only guarantee the “asymptotic” (super-)linear 
convergence (24). 


In the remainder of this section we aim to prove both results by means of some technical lemmas 


which are reported in Section 3.3.1 and Section 3.3.2 


3.3.1 Preliminary results coneerning the functional Jr{x,w,e) 

One important issue in the investigation of the dynamics of Algorithm CG-IRLS is the relationship 
between the weighted norm of an iterate and the weighted norm of its predecessor. In the following 
lemma, we present some helpful estimates. 

Lemma 4 Let i", , i", and the respective tolerances tol„ and tol„+i as defined in Algo¬ 

rithm CG-IRLS. Then the inequalities 


+ l I 




I 




< \/toI 


n+l) 


and 


o.n+1 I 




< Wn 


2(w" 1 ) 


-i\ + V to 


(26) 

(27) 


hold for all n ^ 1, where Wn '■= 




n—1 


Proof Inequality (26) is a direct consequence of the triangle inequality for norms and the property that 


l^n+i _ ~n+i||^ ^ ^ y^tol„+i of step 2 in Algorithm CG-IRLS. 




In order to prove inequality (27), we first notice that i", S Using that is the minimizer 


of IMI^ 2 (n)") ■^‘f(y)) ’^0 obtain 


0.71+1 I 




^ Hi" 




< 


_ 1 1 
^n-1 


Dn^X^ 
_ 1 

n 2 -n 
^n-1^ 


r) 2 r)2 r) 2 -T 


= Wn Hi" 


fin ( Hi"H^2(n;"-i) + V toln ) , 


where the last inequality is due to (26). 


The functional J'r{x,w,£) obeys the following monotonicity property. 
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Lemma 5 The inequalities 


Jr < Jr {x'‘' TW 


~n+l 


,e”+^) < Jr (i- ' TW 


-n+l 




(28) 


hold for all n > 0. 

Proof The first inequality follows from the minimization property of The second inequality follows 

from < e". 


The following lemma describes how the difference of the functional, evaluated in the exact and the 
approximated solution can be controlled by a positive scalar a„+i and an appropriately chosen tolerance 
tol^i+i. 


Lemma 6 Let a„+i be a positive scalar, 

and 5;"+^ = argmin Jr . If we choose tol„ 


, and as described in 

as in ( 161 , then 


Algorithm CG-IRLS, 


<a„+i, (29) 

IJV e") — Jt w", e") I < a„+i, and (30) 

J,(x"+\u;”+\e"+i) <X (x"+\+2a„+i. (31) 

Proof The core of this proof is to find a bound on the quotient of the weights from one iteration step 
to the next and then to use the bound of the difference between x"'*'^ and x"'*'^ in the £ 2 (w”)-norm by 
tol„+i. Starting with the definition of Wn+i in Lemma the quotient of two successive weights can be 
estimated by 


Wn+l = 


< 


D ^ 




= -t / max 


Wp 


max 


c”p + (e”)^) ^ 


y-'' {lij+y+ (£"+>)2)' 




max + (e") 2 -'^ 

1=1,...,N ^ 


= W„+i, 


(e"+i) 2 -^ 

where lT„+i was defined in (18). By choosing tol„+i as in ([T^, we obtain 


jJr (x"+\r(;"+\£”+i) - Jr (i”+\w;"+\£"+i) 
N 




N 


1E(I 




(32) 


N 


N 




^n+l -n+ll-^ n+1 


^||x”-^^l + |x;‘-'*ii w. 


'n+li I |;;.n+l|| 


vi=i 


^ 1=1 


r W, 

max 


»+l / N 


‘ 2 e=i,...,N w'l 


E 

0=1 




N 

E 

vi=i 


|x-+i| + |x-+i||^< 


n +1 11-^ Ih 2 (u)")lll^ I + |x + 


llx"+i-x"+i 




y/toln+i (| 


I 


1 ^ 2(11 


I 


1 ^ 2 ( 1 ^’") 


^ 2 ^"+i 


AWn f\/tol„') + i/tolji+i 


Cn “f \/tolyi+l 


^ <^n+l 5 
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where we have used the Cauchy-Schwarz inequality in the first inequality, (261 and (27) in the fifth 
inequality, (32) in the third inequality, the definition of c„ in ( |17[ ), and the Assumption (16) on tol„+i 
in the last inequality. 


Since 1 ^ Wn+i, we obtain (30) by 


I - Jr (x'‘' % | = 


p.71+1 „,,n n\ 


N 




i=i 


N 






N 

El 






n+1 


N 

E 

vt=i 




^ -bb^+iv^toln+i Cn + x/iol n+1 
with the same arguments as above. Lemma yields 


N 

.S' 

^ ^n+ 1 ? 


1 + 1 + 


+ +”+\u;"+i,e"+i) < + (, 


,n+l n+1 _n+l 


X ,W 

w 


e”+i) + a„+i < + (i"+\ re", e"+i) + , 


'-n+l 


< Jr w”,e”) + a„+i < Jr (i"+\t(;”,e”) + 2a„+i, 


where the first inequality follows from (|29[), the second and third by (28), and the last by (30). 


In the above lemma, we showed that the error of the evaluations of the functional Jr on the ap¬ 
proximate solution i" and the weighted ^ 2 -ininimizer x" can be bounded by choosing an appropriate 
tolerance in the algorithm. This result will be used to show that the difference between the iterates 
and i” becomes arbitrarily small for n —>■ oo, as long as we choose the sequence (a„)rtGN summable. This 
will be the main result of this section. Before, we prove some further auxiliary statements concerning the 
functional Jr(x,u',e) and the iterates i” and w”. 


Lemma 7 Let (a„)„gN 7 On € R-i-) be a summable sequenee with A := an < oo, and define C := 3A + 


Jr as in Theorem^^ For each n>l we have 


N 


Jrii^+\w^+\e^+^) =J 2 {\ 


^n+l|2^(^n+l)2^2 


i=i 

^”111 < + 

,,n 


wf > C ^ = and 


+IL. llxl 


e2iw^) allxGC 


N 


(33) 

(34) 

(35) 

(36) 


Proof Identity (33) follows by insertion of the definition in step 4 of Algorithm CG-IRLS. 

By the minimizing property and the fact that x" £ ^<p{y), we have 

J+x"+\ u;", e+ < J+x”, re”, e”), 


and thus, together with (31), it follows that 


Jr (x"+\ u;"+\ e"+i) < Jr (x"+\ u;", e") + 2a„+i < Jr (x", , e") + 2an+i. 


Hence, the telescoping sum 

n 

(+ - Jr (+,+,£' 




)) ^ ^ ^ ^ Q^fc+1 
fc=l 
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leads to the estimate 


Jr < Jr (a;^, +2A^ Jr [x^ , w° , e°) +2A + ai. 


Inequality (34) then follows from (|29[) and 


N 


^ = =Jr{x^+\w^+\e^+^) 

i=i 

^ Jr + a„+i < C, for all n^l. 


Consequently, the bound (35) follows from 

2-t 


«) < 


< Jr < C. 


Inequality (36 1 is a direct consequence of (351. 


Notice that (34) states the boundedness of the iterates. The lower bound (35) on the weights w" will 


become useful in the proof of Lemma 

By using the estimates collected so far, we can adapt [Ml Lemma 5.1] to our situation. First, we shall 
prove that the differences between the n-th t! 2 (ac"~^)-minimizer and its successor become arbitrarily 
small. 


Lemma 8 Given a summable sequence (a„)„gN: o-n G K+, the sequence satisfies 

2 


E 

n—l 


\X — X 


< -cj 


(37) 


where C is the constant of Lemma^^and x" = argmin Jr {x,w'^ ^). 4s a consequence we have 


xeJ^4,{y) 

lim ||x"-x"+M 


= 0 . 


(38) 


Proof We have 


[Jr {x^, w^, e”) - Jr (x”+\ e"+i) + 2a„+i] 


> - [Jr{x^,w^,e^)-Jr{^ 


= (x",xX^ - (x"+\= (x” + 


-n+l »n _ 'Ti+1\ 

tAJ « tJj j 


N 


= (x" - x"+\ x” - 


«;”|x”-x”+ip > C- 


|x” -x"+i| 


1^2 ■ 


2=1 


Here we used the fact that x" — x”’*'^ S and therefore, (x”"*"^, x" — x"’*'^) = 0 and in the last step we 
applied the bound (36). Summing these inequalities over n > 1, we arrive at 


N 

El 

n—l 


^n+11 


N „ 
.2-t ^ Z 


< Y. - {x^^\w^+\£^Y + 2a„+i] 


n—l 


< -c- 

T 


N 


Jr (x^tc^e^) + Y 2an+i 


T 


Letting TV —> oo yields the desired result. 

The following lemma will play a major role in our proof of convergence since it shows that not only 


(38) holds but that also the difference between successive iterates becomes arbitrarily small. 


Lemma 9 Let x" be as described in Algorithm CG-IRLS and (an)n 6 N be a summable sequence. Then 

(39) 


lim ||x"-x"+i, = 0 . 

n—^oo " "^2 
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Proof By (36) of Lemma and the condition (16) on tol„, we have 


||i" - P" - < 6-=^ Vtol„ [-f + yi^y 

^ [2 , _ 

since Wn ^ 1 as defined in LemmaSince (a„)„gN is summable, we conclude that 

lim =0. 


I 2 ci^ 


(40) 


Together with Lemmawe can prove our statement: 


"*2 n^oo " "*2 

^ lim ||i"-i"|L + lim ||i”-i”+ML + lim 


n—foo 

= 0 , 


*2 n—foo ' 


where the first and last term vanish because of (40) and the other term due to (38). 


3.3.2 The functional fe.r^z) 

In this section, we introduce an auxiliary functional which is useful for the proof of convergence. From 
the monotonicity of e„, we know that e = lim exists and is nonnegative. We introduce the functional 

n—^oo 


N 


fe.rix) := \x] + 


e^li. 


(41) 


Note that if we would know that i" converges to x, then in view of (33), fg^rix) would be the limit 


of J7r(i", w",e”). When e > 0, the Hessian is given by H{fs^r){x) = diag 


x]{T-l)+e^ 
T 4_t 


N 


. Thus, in 


Ji=i 

particular, H{fg^i){x) is strictly positive definite, so that Pp is strictly convex and therefore has a unique 
minimizer 


£,1 ._ 


:= argmin /e,i(x). 
xeJ^0{y) 


(42) 


In the case of 0 < r < 1, we denote by Xg^riu) the set of global minimizers of f^^r on iF<p{y). For both 
cases, the minimizers are characterized by the following lemma. 

Lemma 10 Let e > 0 and x G iF<i-{y). If x = x^I or x € X^^riy), then ^ = 0 for all y G 

where w{x, e, t) = 




N 


i=l 


. In the case of t = 1 also the converse is true. 


Proof The proof is an adaptation of na Lemma 5.2, Section 7] and is presented for the sake of com¬ 
pleteness in Appendix [S] 


3.3.3 Proof of convergence 

By the results of the previous section, we are able now to prove the convergence of Algorithm CG-IRLS. 
The proof is inspired by the ones of m Theorem 5.3, Theorem 7.7], see also [201 Chapter 15.3], which 
we adapted to our case. 
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Proof (Proof of Theorem^ Since 0 < < e" the sequence (e"')„gN always converges to some £ > 0. 

Case £ = 0: Following the first part of the proof of [M Theorems 5.3 and 7.7], where the bound¬ 
edness of the sequence i" and the definition of £" is used, we can show that there is a subsequence 
(a;P^)p^gN of (a:")„gN such that x G J-<p{y) and x is the unique .^^--minimizer. It remains to show 

that also i" —>■ x. To this end, we first notice that x^^ -G x and -G 0 imply Jr (x^pw^^eP^) —>■ ||ai||J^. 
The convergence of Jr (a;”, ru”, £") —)■ \\x\\J^ is established by the following argument: For each n S N 
there is exactly one i = i{n) such that pi < n ^ Pi+i- We use (31) and (29) to estimate the telescoping 


sum 


n—1 


\Jr (i", W^, £") - Jr , ZCJ’X") , ) K ^ | J. {i^+\w’^+\8^+^) - Jr (^^ S^) \ 


k—pi 

n—1 

^4 ^ Ofc+i. 

fe=Pi(„) 


Since < 0° ^^is implies that lim„_>oo \ Jt (i"',w",£") — Jr (i^>('*),r(;P*("),£P*("))| = 0 so that 

lim X(i", «;”,£") = \\x\\f . 

n—^oo 


Moreover (33) implies 


J,(i",u;”,£")-iV(£")"< < J,(i",rc",£”), 

and thus, ||i"||J —>■ \\x\\f . Finally we invoke Lemmaj^with z' = and z = a: to obtain 

= 0 , 


lim sup Hi" — x\\f < I ( lim 

n-foo ^ 1 ~ 7 


which completes the proof of i" —>■ i in this case. To see (19) and establish (i), invoke Lemma 

Case £ > 0: By Lemmaj^ we know that (i")„gN is a bounded sequence and hence has accumulation 
points. Let (i”‘) be any convergent subsequence of (i”)„gN and let x G Zr{y) its limit. By ( [ioj ), we 
know that also x G Following the proof of [IH] Theorem 5.3 and Theorem 7.7], one shows that 

(i, r])^(^r: e t) ~ ^ 7 ^ where w{x, £, r) is defined as in Lemma 


10 


In the case of r = 1, Lemma 10 implies x = x^’^. Hence, a;®4 is unique accumulation point of 
(i")„gN- This establishes (ii). 

To prove (hi), assume that x G Zr{y) (7 Xe,r{y)i and follow the proof of [HI Theorem 5.3, and 7.7] to 
conclude. 


3.3.4 Proof of rate of convergence 

The proof follows similar steps as in mi Section 6]. We define the auxiliary sequences of error vectors 
fy" := x" - X* and f)" := i” - x*. 


Proof (Proof of Theorem^ We apply the characterization (12) with w = ic", x = = x* + 

and 77 = — x* = which gives 


N 


E(7+'ir')ir‘»i =»■ 

Rearranging the terms and using the fact that x* is supported on R, we obtain 


N 


N 


|f7”+T< = 

1=1 1=1 j&A 




^ 7 ?"+^ 

^pn |2 + (£„) 2 ]V ^ 


By assumption there exists uq such that En^ ^ R*. We prove (24), and En ^ R* ^ En+i ^ R* to 


obtain the validity for all n ^ uq. Assuming En ^ i?*, we have for all j G A, 

\Vj \ < Wv'^hr = \/^ < 


(43) 
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and thus 


so that 


|S”| = \x* + ^\x*\- |? 7 "| ^\x*\- iy\x*l 


'r.n\2 


< 


< 


{l-y) 


2—r I O’* 11 —T ' 


( 44 ) 


Hence, (43) combined with (44) and the NSP leads to 


N 


vi=i 




j liar'll?, 




n+liir 


(1 — j 1 ^;,* 


( 1 - 1 -) lUx 


Combining [TOl Proposition 7.4] with the above estimate yields 


WAt^\\l = 




. i^A<= 


€ 


n+1 


N 


|2r 

\i2i'w^) 

2-r 


K”)- 




i 2r (W^) 
2-t 


s'Eir’is" Efe-i+c' 

ii=i / 


^_ 2 ^_ 

(1 — r) ^min|a:*|^ 


{WWl + {N-k) . (45) 


r(l-T) 


It follows that 

II^A^'lll < - Y -r7(T37) (ll^”lll ^{N-k) (e-yf-^ . 

(1 — ( min lx* I ) 

Vie^ ■' J 

Note that this is also valid if ff)y^ = 0 since then the left-hand side is zero and the right-hand side 
non-negative. We furthermore obtain 


wr^^wi = ii«+ii + < y+ywAt^wi 

/ 7 ( 1 + 7 ) /, 


r(r-l) 


(1 — i/yC^-'^') ^min|x*|^ 

In addition to this, we know by [161 Lemma 4.1, 7.5], that 

{J-j)r{xyj < \\x-x'\\y +aj{x')g^. 


l + iN-k) is-yy 


(46) 


(47) 


for any J > j and x,x' £ . Thus, we have by the definition of e" in step of Algorithm CG-IRLS 

that 


{N - k){e^y < (TV - kw {rixyK+iy < y, - ^111 + < 7 k{x*yy 


K + l-k 


{N-k)j3'^ 

K+l-k 


\mi 


(48) 
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since by assumption ak{x*)g^ = 0. Together with (46) this yields 




7(1 + 7 ) 


(1 — j Ij.* 

\jGA -5 


r(l-r) 


1 + 


{N - k)l3'^ 
K + l-k 


2-T 


11^7' 


ra||T(2-T) 




Finally, we obtain (24) by 

F;„+i = < ||r?"+i| 

< \\r^% +{NC) 


11^^ n- ||i"+i -x”+i| 

1-5 _ ^.’^+1 






,n+l I 


+ ||i"+" -i 


,n+l -n+l| 




where we used the triangle inequality in the first inequality, (36) in the third inequality, and C is the 
constant from Lemma 0 

Equation (|2^ then follows by condition (23). By means of (20), we obtain 


E^+i < < /i < R\ 


and therefore the linear convergence for r = 1 , and the super-linear convergence for r < 1 as soon as 
n ^ no. 


4 Conjugate gradient acceleration of IRLS method for fT--norm regularization 

In the previous chapter the solution x* was intended to solve the linear system ^x = y exactly. In most 
engineering and physical applications such a setting may not be required since the measurements are 
perturbed by noise. In this context, it is more appropriate to work with a functional that balances the 
residual error in the linear system with an fx-norm penalty, promoting sparsity. We consider the problem 

mjn (ftAx) ■= \\x\A + 


where A > 0, y € C™ is a given measurement vector, and 0 < r ^ 1 . 


Definition 7 Given a real number e > 0, x S C”, and a weight vector w € K^, rc > 0, we define 

N r 


JrAx,W,s) ■= 2^2 
1 = 1 


\xj\ Wj +e Wj + 


2-t 




(50) 


Lai, Xu, and Yin in m and Daubechies and Voronin in gam] showed independently that computing 
the optimizer of the problem (49) can be approached by an alternating minimization of the functional 
Jt.\ with respect to x, w, and e. The difference between these two works is the definition of the update 
rule for e. Here, we chose the rule in step of Algorithm [^proposed by Daubechies and Voronin because 
it allows us to show that the algorithm converges to a minimizer of (|4^ for r = 1 and to critical points of 


(49) for T < 1 (more precise statements will be given below). However, we were not able to prove similar 
statements for the rule of Lai, Xu, and Yin. It only allows to show the convergence of the algorithm to 
a critical point of the smoothed functional 


min 

X 


X 


II^T 



y\\ 


2 

A’ 


N 

where ||x||J ^ ■= with e = lim„_>ooe"'- 

3=1 
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Algorithm 5 IRLS-A 

1: Set := (1,.. ., 1), e® := 1, o S (0,1], 4> S (0, j^)- 

2: while e" > 0 do 

3: a;"+^ := argmin e") 

X 

4: e"+l := min {s’", | + o"+^} 

5: := argmin 

it;>0 

6: end while 


We approach the first step of the algorithm by computing a critical point of w, e) via the first 

order optimality condition 

^ -y) = 0, (51) 

or equivalently 

^(1>*<P + diag x = <l>*y. (52) 


We denote the solution of this system by The new weight is obtained in step 3 and can be 

expressed componentwise by 


^^"+l = ((x)‘+l)2 + (e"+l)2)-^. 


(53) 


Similarly to the previous section we propose the combination of Algorith m [5| with the CG method. CG 
is used to calculate an approximation of the solution of the linear system (52) in line 3 of the algorithm. 
After including the CG method, the modified algorithm which we shall consider is Algorithm CG-IRLS-A. 


Algorithm 6 CG-IRLS-A 

1: Set 1), i „ g l]_ ^ g (Q, ^), 

2: while > 0 do 

3: Compute by means of CG, s.t. ^ tol^i+i, 

where := argmin £^). Use as the initial vector for CG. 

X 

4: e"+l := min {s’", | -t a"+^} 

5: := argmin 

it;>0 

6: end while 


Notice that x always denotes the approximate solution of the minimization with respect to x in line 
3 and x the corresponding exact solution. Thus fulfills (52) but not x”+^. 

Theorem [2 provides a stopping condition for the CG method, but as in the previous section it is not 
practical for us, since we do not dispose of the minimizer and the computation of the condition number 
is computationally expensive. Therefore, we provide an alternative stopping criterion to make sure that 
||jn-i-i _ ^ tol„+i is fulfilled in line 3 of Algorithm CG-IRLS-A. 

Let x”+^’* be the Lth iterate of the CG method and define 


A„ := + diag [A™!;*]^^^^ . 

Notice that the matrix is positive semi-definite and XtD~^ = Ardiag is positive definite. 

Therefore, A„ is positive definite and invertible, and furthermore 


Amin(^n) ^ Aniin(diag [AtIC 




(54) 


We obtain 


;;n+l 


-n-i-i,; I 












Dr, 


1 - 1 | 


r.n+1,1 1 


U2 


(55) 
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where := ^*y — is the residual as it appears in line 5 of Algorithm[^ The first factor 

on the right-hand side of (55) can be estimated by 


D~ 


= An 


(d-J) 


= /max It;:- = 


_ Z — T 

max -b (e")^^ ^ ^ (e")' 


The second factor of (551 is estimated by 


l^n^ll = (Amin(-4„)) ^ (^Ai„in(diag = I At f ^max|i"|^ -b (e" 


2 -xv -1 


where we used (54) in the inequality. Thus, we obtain 


max \ Xi 


|;j.n+l _ Ara+lbll <- 

r We^iw^) ^ 


V -1 




n+1,/1 


(e”) 2 At 


and the suitable stopping condition 


^n+l,Z|| ^ 


(e")~ At 


2-t 


^max|a;"|^ -b (e”)^ 


(56) 


In the remainder of this section, we clarify how to choose the tolerance tol„+i, and establish a 


convergence result of the algorithm. In the case of t = I, the problem (49) is the minimization of the 


well-known LASSO functional. It is convex, and the optimality conditions can be stated in terms of 
subdifferential inclusions. We are able to show that at least a subsequence of the algorithm is converging 
to a solution of (49). If 0 < t < 1, the problem is non-convex and non-smooth. Necessary first order 


optimality conditions for a global minimizer of this functional were derived in [U Proposition 3.14], 
and [m Theorem 2.2]. In our case, we are able to show that the non-zero components of the limits of 
the algorithm fulfill the respective conditions. However, as soon as the algorithm is producing zeros in 
some components of the limit, so far, we were not able to verify the conditions mentioned above. On 
this account, we pursue a different strategy, which originates from [45]. We do not directly show that the 
algorithm computes a solution of problem (49). Instead we show that a subsequence of the algorithm is at 


least computing a point x\ whose transformation x^ = is a critical point of the new functional 


F.. \ ix?\ := 


(57) 


where 


Nc,-. ^ , Wcix)), ■■= sign(a::j)|a:j 


j = l,...,iV, 


(58) 


is a continuous bijective mapping and 1 < u ^ 2. It was shown in that assuming is a 

global minimizer of F^^x{x) implies that x^ is a global minimizer of TV,A) be., a solution of problem (49). 


Furthermore, it was also shown that this result can be partially extended to local minimizers. We comment 
on this issue in Remark]^ These considerations allow us to state the main convergence result. 


Theorem 5 Let 0<t^1, A>0, £TnxN ^ y g £m^ Define the sequences (i"')„gNj (£")neN o-nd 
(w^)nGN ols the ones generated by Algorithm CG-IRLS-X. Choose the accuracy tol„ of the CG-method, 
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such that 


toL ^ min < a„ 


+ 2\/ Y 





2-t \ - : 


( 59 ) 


with C, 1 ,7^-1 := 


'max(i" + (e” 
j ■' 


( e ")2 


1 -; 


(60) 


where (a„)„gN *s a positive sequence satisfying < oo and J := JT-,\{x^,w^,e^). 

Then the sequence {x'^)neti has at least one convergent subsequence (i"'=)„j,gN- Iri the case that t = 1 
and x^ ^ 0, any convergent subsequence is such that its limit x^ is a minimizer of Fi x{x). In the 
case that 0 < r < 1, the subsequence (i"'=)„j,gN can be chosen such that the transformation of its limit 
— A/"”! 1 < u ^ 2, as defined in (58), is a critical point of (57). If x^ is a global minimizer 


■■= K/M' 


of (57), then x^ is also a global minimizer of Fr^\{x). 


Remark 5 Note that the bound (59) on tol„ is—in contrast to the one in Theorem [S] —not implicit. 


Although tol„ depends on e", the latter only depends on x 


n—1 ^n—1 


W 


.n—1 


and x" e" ^ 


w 


n-2 


Since 


in particular e" does not depend on x", we are able to exchange the steps 3 and 4 in Algorithm]^ 

As we argued in Remark a possible relaxation of the tolerance bound (16) is allowed to further 
boost the convergence, the same applies to the bound (59). 


Remark 6 In the case 0 < r < 1, the theorem includes the possibility that there may exist several 
converging subsequences with different limits. Potentially only one of these limits may have the nice 
property that its transformation is a critical point. In the proof of the theorem, which follows further 
below, an appropriate subsequence is constructed. Actually this construction leads to the following hint, 
how to practically choose the subsequence: Take a converging subsequence x„, for which the u/ satisfy 
equation (85). 

It will be important below that a minimizer x^ of Ai ,\(x) is characterized by the conditions 


- 'Px^))j = Asign(x*) if x* ^ 0, 
!(<?*(?/ - I’x'^))j\ < A if x'j = 0. 


(61) 

(62) 


Note that in the (less important) case x^ = 0, our theorem does not give a conclusion about x^ being a 
minimizer of Fi^x{x). 

Remark 7 The result of Theorem for 0 < t < 1 can be partially extended towards local minimizers. 
For the sake of completeness we sketch the argument from [38]. Assume that x^ is a local minimizer. 
Then there is a neighborhood U^{x^) with e > 0 such that for all x' € U^{x^): 

Fv,\{x) ^ F^^x{x^)- 


By continuity of Af^j/r there exists an e > 0 such that the neighborhood Ue{x^) C Nv/riUeix^))- Thus, 
for all X G Ui{x^), we have x' = Af~^\.{x) G Ue{x^), and obtain 


Fr,xix) = Ml + ^\\<Px-y\\l = ||A7„/.(x')||I, + ^\\<PM/r{x') - y\\l 
= ll^'lll + ^II^A7„/.(x') - y\\l = FMxl 
> FMx^) = llx^ll,-^ + ^||<?>A7 ./,(x") - y\\l 

= M\\l + ^\M^-y\\l=PrAx^)- 

For the proof of Theorem we proceed similarly to Section by first presenting a sequence of 
auxiliary lemmas on properties of the functional J^- x and the dynamics of Algorithm CG-IRLS-A. 
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4.1 Properties of the functional Jt,\ 


Lemma 11 For the functional J^-^x defined in (50), and the iterates i”, re”, and e" produced by Algo¬ 
rithm CG-IRLS-X, the following inequalities hold true: 




(63) 

(64) 

(65) 


Proof The first inequality holds because is the minimizer and the second inequality holds since 

e"+i ^ e". In the third inequality we use the fact that the CG-method is a descent method, decreasing 
the functional in each iteration. Since we take i" as the initial estimate in the first iteration of CG, the 
output of CG must have a value of the functional that is less or equal to the one of the initial 
estimate. 


The iterative application of Lemma 11 leads to the fact that for each n G N+ the functional Jt,x is 
bounded: 

0 < < Jr,A(i\ w°,e°) = 3 . (66) 


Since the functional is composed of positive summands, its definition and (66) imply 

^ V^, 


1 ^ 


N 


\ i=i 


G 

■'3) 




2 J 

T 

2 -r 

tJ 


(67) 


j = l,...,iV. 


The last inequality leads to a general relationship between the t' 2 -norm and £ 2 (w")-norm for arbitrary 
a; e 

I 2-t 

' 2 -t' 




T J 


mi2 


( 68 ) 


In order to show convergence to a critical point or minimizer of the functional PV.A, we will use 
the first order condition (51). Since this property is only valid for the exact solution we need a 

connection between and a;”+^. Observe that 

J,,A(i”+\a;",e") ^ J,.A(i"+\ ic", e") 


(69) 


since a;"^^ is the exact minimizer. From (69) we obtain 

N N 


i=i 




which leads to 




1 


2A 


(70) 


Since (69) holds in addition to (65) and (66), we conclude, also for the exact solution the bound 

(71) 


11'^^” - y\\i2 < ^2AJ^,A(i:”,w’^-i,e’^-i) 5$ 


for all n S N, and 


^n+l I 


\l.2{w'" 




^ j2J 

^ \ / • 


(72) 
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Additionally using (l7T|), we are able to estimate the second summand of (l70|) by 


= + 2 

(||<?i-+i - <Px^+^\U, + 2\\<Px^+^ - y\\i,) 

^ll^jn+i _ (||<?i”+i - y\U, + 3||<?x"+i - y\U,) ^ 4\/2AJ||<?|| P"+i - x^+%,, 


( 73 ) 


where we used the Cauchy-Schwarz inequality in the second inequality, the triangle inequality in the 
third inequality, and the bounds in (67) and 0 in the last inequality. 

The following pivotal result of this section allows us to control the difference between the exact and 
approximate solution of the linear system in line 3 of Algorithm CG-IRLS-A. 


Lemma 12 For a given positive number a„+i and a choice of the accuracy toln+i satisfying (59), the 
functional Jr^\ fulfills the two monotonicity properties 




,w 


,£”+')-Jr,^ a 


■n+1 


and 


- P,A(x”+\u;”,e") ^ a„+i. 

Proof By means of the relation 


(74) 

(75) 


y,"+l = _ <W". 

3 3 yjTi ^3 \ (;f.n+1^2 _|_ (pn+l\2 


{iff + (e")2 

(i"+l)2 + (e»+i)2 


'max(a;”)^ -|- {e^f 


^ w 


n I J 


(e"+i )2 


= wfC^., 


where Cu,»» was defined in (60), we can estimate 


Jr,xix ) - Jr.A(i"+\w”+\e”+") 


N 

«5 E (*”■ - ^r‘) (*r ‘+^r') 

i=i 




< - 

2 




4a/^, 


2A 


T 

^ 2 ' 


\ 3 = 1 


N N . /n-K j 


\ 3 = 1 


2 




4a/^ 


2A 


2A 






n+1 1 




(lo")) „Ip Ip. 


v/^C'„,rp"+l -i"+l 


4a/TAJ //2-r 


>) 2A 


|p||p”+i-x’^+i| 




2A 


2-r 

tJ 


tJ 

IPII I ^ «" + l. 


where we used (73) in the second inequality, Cauchy-Schwarz in the third inequality, and (^ 68 L (67), 


and (72) in the sixth inequality. Thus we obtain (74). To show (75), we use ( 68 ) in the second to last 
inequality, condition (59) in the last inequality and the fact that 5:"+^ = argmin J,->,(a;, u>”, £") (and 
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thus fulfilling ©) in the second identity below: 

\T 

1 


( 76 ) 


= i E ((*r')' - (*r')') <+^ +2 - y),j (77) 

i=i 

N 

= i E ((^r')' - - 25r'^r '+2 m 


N 




i=i 

< /r W 

'' I 2 2A V T-J 


lyn+l _ An+l||2 ^ 

^ I h2(™") ^ ®n+l- 


(79) 

(80) 


Besides Lemma [T^ there are two more helpful properties of the functional. First, the identity 




i2 

can be shown by the same calculation as in ( [76| , by means of replacing by i”. Second, it follows 
in particular that 




2-t 

tJ 


-x^Wl < -F 


n+1 ;^71||2 


2 II- - iir2(™") 

^ P,Pi" ,«;",£”)- , xc", e"). 


(81) 


where the estimate (681 is used in the first inequality. 


4.2 Proof of convergence 

We need to show that the difference i"+^ — i" between two successive exact iterates and the one between 
the exact and approximated iterates, i" — i", become arbitrarily small. This result is used in the proof 
of Theorem]^ to show that both (i")„gN and (i")„gN converge to the same limit. 

Lemma 13 Consider a summable sequence (an)„gN o.nd choose the accuracy of the CG solution tol„ 
satisfying (59) for the n-th iteration step. Then the sequences (a;")„gN (i")nGN have the properties 

lim p” - = 0 (82) 

n—>-oo 

and 


lim p"+i =0. 

n—>-oo 


(83) 


Proof We use the properties of J, which we derived in the previous subsection. First, we show (82): 

' M 


2-r 

rJ 


M 


E 




^ ^ Jr,x{x^, xc", e") - P,A(i"+\ xc", £") 


n—1 

M 


^ J2 PA(i", w^, £") - P,A(i"+l, xc", £") + a„+i 


n—1 

M 


^ Y, PA(i", «;",£") - P,A(i"+\xx;"+\£"+i) + a„+i 


n—1 

M 


^ ^ PA(i",«^",£") - PA(i”+\«^”+\ £"+') +2a 


n+1 


n—1 
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M 

+ 2 Cln +1 


M 

^ J -\-2 Cin+l- 

n=l 


We used (81) in the first inequality, (751 in the second inequality, (63) and (641 in the third inequality, (74) 
in the fourth inequality and a telescoping sum in the identity. Letting M —)■ oo we obtain 


2-t 

tJ 


El 

n—l 


- X 


,n ||2 


< J + 2 ^ a„+i < oo 


n—l 


and thus (82). 


Second, we show (83). From line 1 and 3 of (76) we know that 


,W 




= dE 


W-,£^) - Jr,\(,X- 
N 

' x-^+^f - {x-^+^f - 2i" 


i=i 




+ 2 w- + - <!>x-+^\\l 


e2(w’-) + ^11^' 


;n+l 


— 


n+l 112 
1^2 • 


Since the second summand is positive, we conclude 






Together with (75) we find that 


2-t 

T J 


i"+l - T 


n+l ||2 


< ||r”+^ — 

^ < «„+i, 


and thus taking limits on both sides we get 


2-r 

tJ 


lim sup ^ lim a„+i = 0, 

n^oo n^oo 


which implies (83). 


Remark 8 By means of Lemma 13 we obtain 
lim IIt” lim 


|i" —i:"||f2+ 1™ \\x^ — X 
n—fCJO ' n—foo 


n+l I 


+ lim —i 


sn+l I 


U 2 


= 0 . 


(84) 


The following lemma provides a lower bound for the e", which is used to show a contradiction in the 
proof of Theorem 5 Recall that (p G (O, jjr,:) is the parameter appearing in the update rule for e in step 
l^of both the algorithms CG-IRLS-A and IRLS-A. 

Lemma 14 ( |43l Lemma 4.5.4, Lemma 4.5.6]) Let r = 1 and thus wj = j € 

,N}. There exists a strictly increasing subsequence and some constant C > 0 such that 


Proof Since Jr,\{x'^, w'^, e") is decreasing with n due to Lemma 11 and bounded below by 0, the difference 
\Jr^\{x'^~^ ,£^~^) — Jr,\{x^, w”, e")| is Converging to 0 for n —>• 00 . In addition —)■ 0 forn —> 00 , 

and thus by definition also e” —>■ 0. Consequently there exists a subsequence such that 




(85) 


Following exactly the steps of the proof of [131 Lemma 4.5.6.] yields the assertion. Observe that all of 
these steps are also valid for 0 < r < 1, although in [HI Lemma 4.5.6] the author restricted it to the 
case T ^ 1. 
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Remark 9 The observation in the previous proof that (e”) converges to 0 will be again important below. 


We are now prepared for the proof of Theorem ([^. 


Proof (Proof of Theorem^ Consider the subsequence (a;"')igN of Lemma 14 Since is bounded 

by (67), there exists a converging subsequence which has limit 

Consider the case r = 1 and ^ 0. We first show that 


— oo < lim = lim < oo, for all j = 1,..., N. 


( 86 ) 


"fc'i 

j Mfc 


It follows from equation (51) and the boundedness of the residual 0 that the sequence 
is bounded, i.e., 

- y)|| ^ a 

Therefore, there exists a converging subsequence, for simplicity again denoted by To show 

the identity in (86), we estimate 




tol 




•‘rifc + l 


Orifc + l^ 


rifc + l 




nfc + 1 


_ — ^ _ —/tfc-r _ 

V^^max(i”'')2 + {e"<=)2.^(i-'-)2 + (en.)2 " ^^(max |if !)(£"<=) 


< 


flrife+l 


V^( 


max a;« 


for all j = 1,..., IV, where the second inequality follows by the upper bound of tol„ in (59), and the last 


inequality is due to the definition of £"+^ which yields ^ 1. Since we assumed lim i"'' = cc'*' ^ 0, 

^ fc—>-oo 

there is a such that for all k ^ fcoj we have that max |if | ^ c > 0. Since (a^j.) tends to 0, we conclude 

3 ■> 

that lim — if | = 0, and therefore we have (IS^. Note that we will use the notation 

ko several times in the presentation of this proof, but for different arguments. We do not mention it 
explicitly, but we assume a newly defined fcp to be always larger or equal to the previously defined one. 

Next we show that xf is a minimizer of Fi x by verifying conditions (61) and (62). To this end we 
notice that by Lemma 13 and Remark 8 it follows that lim if = lim if = lim if = x^. By 

_ I I Ai—>-oo ^ k—^oo ^ fc—>-o o 

means of this result, in the case of x^ 0, we have, due to continuity arguments, (511 and Remark 


— {<!>*{y — <l>x^))j = lim —{<P*{y — <RxP‘‘))j= lim Ai 

k—>-oo k—¥c 


■nk ■nk-l 
3 3 


= A lim if ((if-^)2 + (£”'=-i)2)-2 


k—>-oo 


= Aa;f(a;f^ + (0)f-5 =Asign(a;f, 


and thus (|61 ). 

In or 
4.5.9. in 


. Assume 


In order to show condition (62) for j such that Xj = 0, we follow the main idea in the proof of Lemma 


lim if wf ^ > 1. 


k—^oo 


(87) 


Then there exists an e > 0 and a fco G N, such that for all k ^ ko the inequality (if wf )^ > 1 + e 
holds. Due to (86), we can furthermore choose ko large enough such that also (if wf ~^)^ > 1 + e for 


all k ko. Recalling the identity for wf from Lemma 14 we obtain 


(if)^>(l + e)((u;f-i)-i)2 

= (1 + e)((if-1)2 + (£"'■-1)2) ^ (I + e)(£"'=+i)2 


( 88 ) 


^ (1 + e)C|(u;f )-i|2^|(n;f-i)-i - «'=)-Y'^ ^ (1 + e)C|if l^-^Krcf-i)-i - (ref )-Y‘^, 
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where the second inequality follows by the definition of the e", and the third inequality follows from 
Lemma 


14 


Furthermore, in the last inequality we used that re" ^ |i?| ^ which follows directly from the 


definition of . By means of this estimate, we conclude 






2-20 
40 ^ 


(89) 


Since 0 < 0 < the exponent > 1. In combination with the fact that i”*" is vanishing for fc — >■ (X), 

_ 2 — 20 _ 1 

we are able to choose fcp large enough to have ((1 + e)C)~^ \ < e := 1 — (1 + e )“2 for all 

k ko and therefore 

^ unfc in _ (90) 




The combination of (88) and (90) yields 




> (1 + e ) ' >(1 + | 2(1 - e ) 2 . 


(91) 


Since we have ^ | > 1 + e for all fc 0 fco, we also have x^’‘ ^ 0, and thus, we can divide in (|9 

by I a;"*'I and insert the definition of e to obtain 


l>(l + e)(l-er = 1, 


which is a contradiction, and thus the assumption 
continuity argument, we show condition (|62[) by 


is false. By means of this result and again a 


{'P^{y — *^x^))j = lim (^^(y — '?x"'“))j = A lim x'^^w^'° ^ ^ A. 

/c—>-oo fc—>-oo ' 


At this point, we have shown that at least the convergent subsequence (i”'')„^gN is such that its limit 
x^ is a minimizer of Fi^\{x). To show that this is valid for any convergent subsequence of (a;")„gN, we 
remind that the subsequence (i"''“)„j,gN is the one of Lemma 14 and therefore fulfills (851. Thus, we can 
adapt [ISl Lemma 4.6.1] to our case, following the arguments in the proof. These arguments only require 
the monotonicity of the functional Jt,\, which we show in Lemma 11 Consequently the limit x^ of any 
convergent subsequence of (i")„gN is a minimizer of F'i_A(a^)- 


Consider the case 0 < t < 1. The transformation JVi^{x) defined in (58) is continuous and bijective. 
Thus, x^ := N~^^{x^) is well-defined, and xj' = 0 if and only if = 0. At a critical point of the 

differentiable functional Ft,\, its first derivative has to vanish which is equivalent to the conditions 


-\xj\ " -^*'^-^v/t{x))- +Xvsign{xj)\xj 


V—1 


= 0, j = l,...,N. 


(92) 


We show now that x^ fulfills this first order optimality condition. It is obvious that for all j such that 
x^ = 0 the condition is trivially fulfilled. Thus, it remains to consider all j where x^ ^ 0. As in the 


case of T = I, we conclude by Lemma 


13 


Therefore continuity arguments as well as (1^ yield 


and Remark^ that lim 5"'“ = lim i"'" = lim i"'‘ ^ = x). 


k—¥oo 


fc—>-00 ^ 


k—^oo 


— {<P*{y — ‘l>x^))j= lim — (^*(?/— <?i"'=))j = lim Xtx^'^w 

k—¥oo k—^oo 


nfc-1 


3 

2-t 


= At lim if ((ff-^)2 + (e”''-i)")-' 


A;—>-oo 


= Xrxjdxj) + {0)) 2 = Arsign(a;f 


I ‘T— 1 


We replace x^ = J\f^/r{x^) and obtain 


-{^*{y-<PAf^/r{x^))j = ATsign((04/^(i^))^-)|(A/0/r(i^))j| 


= Ar sign(if |i 


Am— — 


We multiply this identity by f x^j ^ and obtain (92). 

If x^ is also a global minimizer of F^j x, then x^ is a global minimizer of F^ x. This is due the 
equivalence of the two problems which was shown in |38[ Proposition 2.4] based on the continuity and 
bijectivity of the mapping [13 Proposition 3.4]. 
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5 Numerical Results 


We illustrate the theoretical results of this paper by several numerical experiments. We first show that 
our modihed versions of IRLS yield signihcant improvements in terms of computational time and often 
outperform the state of the art methods Iterative Hard Thresholding (IHT) [3] and Fast Iterative Soft- 
Thresholding Algorithm (FISTA) [T]. 

Before going into the detailed presentation of the numerical tests, we raise two plain numerical 
disclaimers concerning the numerical stability of CG-IRLS and CG-IRLS-A: 


— The hrst issue concerns IRLS methods in general: The case where e" —>■ 0, i.e., Xj —>■ 0, for some 
j G {1,..., N} and n —>■ oo, is very likely since our goal is the computation of sparse vectors. In this 
case w" will for some n become too large to be properly represented by a computer. Thus, in practice. 


we have to provide a lower bound for e by some e™™ > 0. Imposing such a limit has the theoretical 
disadvantage that in general the algorithms are only calculating an approximation of the respective 
problems Q and (49). Therefore, to obtain a “sufficiently good” approximation, one has to choose 
ginin sufficiently small. This raises yet another numerical issue: If we choose, e.g., = Ie-8 and 

assume that also x" ^ 1, then wj is of the order 1e-|-8. Compared to the entries of the matrix <P, 
which are of the order 1, any multiplication or addition by such a value will cause serious numerical 
errors. In this context we cannot expect that the IRLS method reaches high accuracy, and saturation 
effects of the error are likely to occur before machine precision. 

— The second issue concerns the CG method: In Algorithm and Algorithm we have to divide at 
some point by or {Ap\p^)i^ respectively. As soon as the residual decreases, alsop® decreases 

with the same order ol^magnitude. If the above vector products are at the level of machine precision, 
e.g. 1e-16, this would mean that the norm of the residual is of the order of its square-root, here 
1e-8. But this is the measure of the stopping criterion. Thus, if we ask for a high precision of the 
GG method, the algorithm might become numerically unstable, depending on the machine precision. 
Such saturation of the error is an intrinsic property of the GG method, and here we want to mention 
it just as a disclaimer. As described further below, we set the lower bound of the GG tolerance to the 
value 1e-12, i.e., as soon as this accuracy is reached, we consider the result as “numerically exact”. 
For this particular bound the method works stably on the machine that we used. 


In the following, we start with a description of the general test settings, which will be common for 
both Algorithms GG-IRLS and GG-IRLS-A. Afterwards we independently analyze the speed of both 
methods and compare them with state of the art algorithms, namely IHT and FISTA. We respectively 
start with a single trial, followed by a speed-test on a variety of problems. We will also compare the 
performance of both CG-IRLS and GG-IRLS-A for the noiseless case which leads to surprising results. 


5.1 Test settings 

All tests are performed with MATLAB version R2014a. For the sake of faster tests (in some cases exper¬ 
iments run for several days) and simplicity, we restrict ourselves to experiments with models defined by 
real numbers although everything can be similarly done over the complex field. To exploit the advantage 
of fast matrix-vector multiplications and to allow high dimensional tests, we use randomly sampled par¬ 
tial discrete cosine transformation matrices <l>. We perform tests in three different dimensional settings 
(later we will extend them to higher dimension) and choose different values N of the dimension of the 
signal, the amount m of measurements, the respective sparsity k of the synthesized solutions, and the 
index K in Algorithm (CG-)IRLS: 



Setting A 

Setting B 

Setting C 

N 

2000 

4000 

8000 

m 

800 

1600 

3200 

k 

30 

60 

120 

K 

50 

100 

200 


For each of these settings, we draw at random a set of 100 synthetic problems on which a speed- 
test is performed. For each synthetic problem the support A is determined by the first k entries of a 
random permutation of the numbers 1,..., A^. Then we draw the sparse vector x* at random with entries 
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X* ^ A/'(0,1) ioT i G A and x\o = 0, and a randomly row sampled normalized discrete cosine matrix <1>, 
where the full non-normalized discrete cosine matrix is given by 


* j 


1, * = l,j = 


For a given noise vector e of entries ^ A/'(0, cr^), we eventually obtain the measurements y = <Px* + e. 
Later we need to specify the noise level and we will do so by fixing a signal to noise ratio. By assuming 
that ^ has the Restricted Isometry Property of order k (compare, e.g., [20]), i-e., ~ lklU 2 , 

z G with ^supp( 2 ;) ^ k, we can estimate the measurement signal to noise ratio by 


MSNR 


E(l|e|UJ 


Vk 


y/rna 


In practice, we set the MSNR first and choose the noise level a = K MSNR = oo, the problem 

is noiseless, i.e., e = 0. 


5.2 Algorithm CG-IRLS 


Specific settings. We restrict the maximal number of outer iterations to 30. Furthermore, we modify (161, 
so that the CG-algorithm also stops as soon as ^ 1e-12. As soon as the residual undergoes 

this particular threshold, we call the GG solution (numerically) “exact”. The e-update rule is extended 
by imposing the lower bound e” = e" V e™™ where e™‘” = 1e-9/A^. The summable sequence (an)nGN in 
Theorem]^ is defined by a„ = 100 • (1/2)". 

As we define the synthetic tests by choosing the solution x* of the linear system <Px* = y (here we 
assume e = 0), we can use it to determine the error of the iterations ||i" — x*\\t^. 


IRLS vs. CG-IRLS To get an immediate impression about the general behavior of GG-IRLS, we compare 
its performance in terms of accuracy and speed to IRLS, where the intermediate linear systems are solved 
exactly via Gaussian elimination (i.e., by the standard MATLAB backslash operator). We choose IHT as 
a first order state of the art benchmark, to get a fair comparison with another method which can exploit 
fast matrix-vector multiplications. 

In this first single trial experiment, we choose an instance of setting B, and set r = 1 for GG-lRLS 
and compare it to IRLS with different values of r. The result is presented in the left plot of Figure[2 We 
show the decrease of the relative error in .^ 2 -iiorm as a function of the computational time. One sees that 
the computational time of IRLS is significantly outperformed by GG-IRLS and by the exploitation of fast 
matrix-vector multiplications. The standard IRLS is not competitive in terms of computational time, 
even if we choose t < 1, which is known to yield super-linear convergence |16j . With increasing dimension 
of the problem, in general the advantage of using the GG method becomes even more significant. However 
GG-IRLS does not outperform yet IHT in terms of computational time. We also observe the expected 
numerical error saturation (as mentioned at the beginning of this section), which appears as soon as the 
accuracy falls below 1e-13. 

For this test, we set the parameter /3 in the e-update rule to 2. We comment on the choice of this 
particular parameter in a dedicated paragraph below. 


Modifications to CG-IRLS As we have shown by a single trial in the previous paragraph, GG-IRLS as it 
is presented in Section [3.2| is not able to outperform IHT. Therefore, we introduce the following practical 
modifications to the algorithm: 

(i) We introduce the parameter maxiter_cg, which defines the maximal number of inner GG iterations. 
Thus, the inner loop of the algorithm stops as soon as maxiter_cg iterations were performed, even if 
the theoretical tolerance tol„ is not reached yet. 
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decrease of relative error 
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Fig. 1 Single trial of Setting B. Left: Relative error plotted against the computational time for IRLS[r = 1] (light green, 
o), IRLS[r = 0.9] (green, □), IRLS[r = 0.8] (dark green, 0), CG-IRLS (blue, x), and IHT (red, —). Right: Relative error 
plotted against computational time for CG-IRLS (blue, x), CG-IRLSm (dark blue, -h), IHT-hCG-IRLSm (black, *), and 
IHT (red, -). 


(ii) CG-IRLS includes a stopping crit erion depending on tol„+i, which is only implicitly given as a 

function of (compare Section 3.3.1 and in particular formulas (16) and ( |17[ )), which in turn 
depends on the current a;"^^ by means of sorting and a matrix-vector multiplication. To further 
reduce the computational cost of each iteration, we avoid the aforementioned operations by only 
updating tol„+i outside the MCG loop, i.e., after the computation of with fixed tol„+i we 

update as in step 3 of Algorithm CG-IRLS and subsequently update tol „+2 which again is fixed 
for the computation of 

(iii) The left plot of Figure [^reveals that in the beginning CG-IRLS reduces the error more slowly than 
IHT, and it gets faster after it reached a certain ball around the solution. Therefore, we use IHT as a 
warm up for CG-IRLS, in the sense that we apply a number start_iht of IHT iterations to compute 
a proper starting vector for CG-IRLS. 


We call CG-IRLSm the algorithm with modifications (i) and (ii), and IHT-hCG-IRLSm the algorithm 
with modifications (i), (ii), and (iii). We set maxiter_cg = [to/ 12J , start_iht = 150, and we set /3 to 0.5. 
If these algorithms are executed on the same trial as in the previous paragraph, we obtain the result which 
is shown on the right plot in Figure For this trial, the modified algorithms show a significantly reduced 
computational time with respect to the unmodified version and they now converge faster than IHT. How¬ 
ever, the introduction of the practical modifications (i)-(iii) does not necessarily comply anymore with 
the assumptions of Theorem Therefore, we do not have rigorous convergence and recovery guarantees 
anymore and recovery might potentially fail more often. In the next paragraph, we empirically inves¬ 
tigate the failure rate and explore the performance of the different methods on a sufficiently large test set. 


In order to investigate the influence of the tolerance toln and the number of (inner) iterations of 
the MGG procedure performed along the IRLS iterations, we plot both quantities in Figure for GG- 
IRLS, GG-IRLSm, and IHT-|-GG-IRLSm. Obviously the tolerance quickly decreases in any method. For 
IHT-|-CG-IRLSm also the number of MCG iterations decreases (until the method becomes unstable), 
while for the other two methods the number of MCG iterations first increases and then decreases again 
until the methods also become unstable. In CG-IRLSm, the number of MGG iterations is bounded by 
maxiter_cg = [m/I2j. This more economical behavior only slightly influences the approximation of the 
MGG solutions and leads to reduced computational time. 

Another natural modification to GG-IRLS consists in the introduction of a preconditioner to compen¬ 
sate for the deterioriation of the condition number of L>DnL>* as soon as e" becomes too small (when w” 
becomes very large). The matrix <I><L* is very well conditioned, while the matrix <I>Dn‘L* “sandwiching” 
Dn becomes more ill-conditioned as n gets larger, and, unfortunately, it is hard to identify additional 
“sandwiching” preconditioners such that the matrix Pn^IDn^P*P* is suitably well-conditioned. In the 
numerical experiments standard preconditioners failed to yield any significant improvement in terms of 
convergence speed. Hence, we refrained from introducing further preconditioners. Instead, as we will 
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Tolerance 



Fig. 2 Single trial of Setting B. Left: Number of MCG iterations (divided by m) plotted against the outer iteration number 
for CG-IRLS (blue, x), CG-IRLSm (dark blue, +), IHT+CG-IRLSm (black, *). Right: Tolerance toln plotted against the 
outer iteration number for the same algorithms. 


show at the end of Subsection 5.3 a standard (Jacobi) preconditioning of the matrix 


(<?*<!>+ diag [a™"] , 


where the source of singularity is added to the product leads to a dramatic improvement of com¬ 
putational speed. 


Empirical test on computational time and failure rate In the following, we define a method to be “suc¬ 
cessful” if it is computing a solution x for which the relative error ||a: — a;*||£ 2 /||a :*||^2 < 1e- 13. The 
computational time of a method is measured by the time it needs to produce the first iterate which 
reaches this accuracy. In the following, we present the results of a test which runs the methods CG- 
IRLS, CG-IRLSm, IHT-|-CG-IRLSm, and IHT on 100 trials of Setting A, B, and C respectively and 
T G {1,0.9,0.8}. For values of r < 0.8 the methods become unstable, due to the severe nonconvexity of 
the problem and it seems that good performance cannot be reached below this level. Therefore we do 
not investigate further these cases. Let us stress that IHT does not depend on r. 

In each setting we check for each trial which methods succeeds or fails. If all methods succeed, we 
compare the computational time, determine the fastest method, and count the computational time of 
each method for the respective mean computational time. The results are shown in Figurej^ By analyzing 
the diagrams, we are able to distill the following observations: 

— Especially in Setting A and B, CG-IRLSm and IHT-|-CG-IRLSm are better or comparable to IHT in 
terms of mean computational time and provide in most cases the fastest method. GG-IRLS performs 
much worse. The failure rate of all the methods is negligible here. 

— The gap in the computational time between all methods becomes larger when N is larger. 

— With increasing dimension of the problem, the advantage of using the modified CG-IRLS methods 
subsides, in particular in Setting C. 

— In the literature [iniiniiiaiiii superlinear convergence is reported for r < 1, and perhaps one of the 
most surprising outcomes is that the best results for all CG-IRLS methods are instead obtained for 
T = 1. This can probably be explained by observing that superlinear convergence kicks in only in a 
rather small ball around the solution and hence does not necessarily improve the actual computation 
time! 

— Not only the computational performance, but also the failure rate of the CG-IRLS based methods 
increases with decreasing r. However, as expected, GG-IRLS succeeds in the convex case of r = 1. 
The failure of CG-IRLS for r < 1 can probably be attributed to non-convexity. 

We conclude that CG-IRLSm and IHT-|-GG-IRLSm perform well for r = 1 and for the problem 
dimension N within the range of 1000 - 10000. They are even able to outperform IHT. However, by 
extrapolation of the numerical results IHT is expected to be faster for N > 10000. (This is in compliance 
with the general folklore that first order methods should be preferred for higher dimension. However, as 
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Fig. 3 Empirical test on Setting A, B, and C for the methods CG-IRLS (blue), CG-IRLSm (white), IHT+CG-IRLSm 
(black), and IHT (red). Upper: Mean computational time. Genter: Fastest method (in %). Lower: Failure rate (in %). 


we will see in Subsection 5.3 a proper preconditioning of CG-IRLS-A will win over IHT for dimensions 
N ^ 10®!) As soon as N < 1000, direct methods such as Gaussian elimination are faster than GG, and 
thus, one should use standard IRLS with r < 1. 


Choice of f3, maxiter-cg, and start-iht The numerical tests in the previous paragraph were pre¬ 
ceded by a careful and systematic investigation of the tuning of the parameters /3, maxiter_cg, and 
start_iht. While we fixed start.iht to 100, 150, and 200 for Setting A, B, and G respectively to 
produce a good starting value, we tried /? G {1/iV, 0.01,0.1, 0.5,0.75,1,1.5, 2, 5,10}, and maxiter_cg G 
{[to/ 8J, [m/12j, [m/16j} for each setting. The results of this parameter sensitivity study can be sum¬ 
marized as follows: 

— The best computational time is obtained for /3 ^ 1. In particular the computational time is not 
depending substantially on /3 in this order of magnitude. More precisely, for GG-IRLS the choice of 
/3 = 0.5 and for (IHT-l-)GG-IRLSm the choice of /3 = 2 works best. 

— The choice of maxiter_cg very much determines the tradeoff between failure and speed of the method. 
The value [to/ 12J seems to be the best compromise. For a smaller value the failure rate becomes too 
high, for a larger value the method is too slow. 

Phase transition diagrams. Besides the empirical analysis of the speed of convergence, we also investigate 
the robustness of GG-IRLS with respect to the achievable sparsity level for exact recovery of x*. Therefore, 
we fix A = 2000 and we compute a phase transition diagram for IHT and GG-IRLS on a regular Gartesian 
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50 X 40 grid, where one axis represents m/N and the other represents k/m. For each grid point we plot 
the empirical success recovery rate, which is numerically realized by running both algorithms on 20 
random trials. CG-IRLS or IHT is successful if it is able to compute a solution with a relative error of 
less than 1e-4 within 20 or 500 (outer) iterations respectively. Since we aim at simulating a setting in 
which the sparsity k is not known exactly, we set the parameter K = 1.1 ■ k for both IHT and CG-IRLS. 
The interpolated plot is shown in Figure It turns out that CG-IRLS has a significantly higher success 
recovery rate than IHT for less sparse solutions. 



Fig. 4 Phase transition diagrams of IHT and CG-IRLS for N = 2000. The recovery rate is presented in grayscale values 
from 0% (white) up to 100% (black). As a reference, in the right subfigure, the 90% recovery rate level line of the CG-IRLS 
phase transition diagram is plotted to show more evidently the improved success rate of the latter algorithm. 


5.3 Algorithm CG-IRLS-A 


Specific settings We restrict the maximal number of outer iterations to 25. Furthermore, we modify 
(56), so that the CG-algorithm also stops as soon as ^ Ie-16- As soon as the residual 


undergoes this particular threshold, we call the CG solution (numerically) “exact”. The e-update rule 
is extended by imposing the lower bound e" = e” V e™™ where e™™ = 1e-9. Additionally we propose 
to choose ^ 0.8"£”, which practically turns out to increase dramatically the speed of convergence. 
The summable sequence (a„)„gN in Theorem i is defined by setting = \fWrn ■ 10^ • (1/2)". We split 
our investigation into a noisy and a noiseless setting. 

For the noisy setting we set MSNR = 10. According to mi, we choose A = caVm logiV as a 
near-optimal regularization parameter, where we empirically determine c = 0.48. Since we work with 


relatively large values of A in the regularized problem (49), we cannot use the synthesized sparse solution 
X* as a reference for the convergence analysis. Instead, we need another reliable method to compute the 
minimizer of the functional. In the convex case of r = 1, this is performed by the well-known and fast 
algorithm FISTA [1], which shall also serve as a benchmark for the speed analysis. In the non-convex case 
of T < 1, there is no method which guarantees the computation of the global minimizer, thus, we have 
to omit a detailed speed-test in this case. However, we describe the behavior of Algorithm GG-IRLS-A 
for T changing. 

If the problem is noiseless, i.e., e = 0, the solution of ( [4^ converges to the solution of (Q for 
A —> 0. Thus, we choose X = m - 1e-8, and assume the synthesized sparse solution x* as a good proxy for 
the minimizer and a reference for the convergence analysis. (Actually, this can also be seen the other way 
around, i.e., we use the minimizer x^ of the regularized functional to compute a good approximation to 
a;*.) It turns out that for A « 0, as we comment below in more detail, FISTA is basically of no use. 


CG-IRLS-X vs. IRLS-X As in the previous subsection, we first show that the GG-method within IRLS-A 
leads to significant improvements in terms of the computational speed. Therefore we choose a noisy trial 
of Setting B, and compare the computational time of the methods IRLS-A, GG-IRLS-A, and FISTA. The 
result is presented on the left plot of FigureWe observe, that GG-IRLS-A computes the first iterations 
in much less time than IRLS-A, but due to bad conditioning of the inner GG problems it performs much 
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worse afterwards. Furthermore, as may be expected, the algorithm is not suitable to compute a highly 
accurate solution. For the computation of a solution with a relative error in the order of 1e-3, CG-IRLS-A 
outperforms FISTA. FISTA is able to compute highly accurate solutions, but a solution with a relative 
error of 1e-3 should be sufficient in most applications because the goal in general is not to compute the 
minimizer of the Lagrangian functional but an approximation of the sparse signal. 


decrease of relative error 



decrease of relative error 



computation time [s] 


Fig. 5 Single trial of Setting B. Left: Relative error plotted against the computational time for IRLS-A (light green, o), 
CG-IRLS-A (blue, x), and FISTA (red, —). Right: Relative error plotted against computational time for CG-IRLS-A (blue, 
x), PGG-IRLS-A (dark blue, -b), PCGm-IRLS-A (black, *), and FISTA (red, -). 


Modifications to CG-IRLS-X To further decrease the computational time of CG-IRLS-A, we propose the 
following modifications: 

(i) To overcome the bad conditioning in the CG loop, we precondition the matrix A„ = (!>*'P+diag 

by means of the Jacobi preconditioner, i.e., we pre-multiply the linear system by the inverse of its 
diagonal, (diag A„) which is a very efficient operation in practice. 

(ii) We introduce the parameter maxiter_cg which defines the maximal number of inner CG iterations 
and is set to the value maxiter_cg = 4 in the following. 

The algorithm with modification (i) is called PCG-IRLS-A, and the one with modification (i) and (ii) 
PCGm-IRLS-A. We run these algorithms on the same trial of Setting B as in the previous paragraph. 
The respective result is shown on the right plot of Figure This time, preconditioning effectively yields 
a strong decrease of computational time, especially in the final iterations where A„ is badly conditioned. 
Furthermore, modification (ii) importantly increases the performance of the proposed algorithm also in 
the initial iterations. However, again we have to take into consideration that we may violate the assump¬ 
tions of Theorem so that convergence is not guaranteed anymore and failure rates might potentially 
increase. In the two paragraphs below that are entitled Empirical test on computational time and failure 
rate with noisy/noiseless data, we present simulations on noisy and noiseless data, which give a more 
precise picture of the speed and failure rate of the previously introduced methods in comparison to 
FISTA and IHT. 

We investigate the influence of the tolerance tol„ and the number of (inner) iterations of the CG 
procedure performed along the IRLS iterations in Figure]^ for GG-IRLS-A, PGG-IRLS-A, and PCGm- 
IRLS-A. We see that the methods do not differ much in terms of the tolerance. In particular CG-IRLS-A 
and PCG-IRLS-A have nearly the same sequence of tol„, however, due to the bad conditioning, the 
number of inner CG iterations tremendously increases with growing number of outer IRLS iterations in 
GG-IRLS-A. In contrast, the number of inner CG iterations in PGG-IRLS-A stays very low. A bound on 
the GG iterations of maxiter_cg = 4 does only very slightly change the behavior of tol„ in PCGm-IRLS-A 
and leads to a further advantage in the computational time, as can be seen in Figure 
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CG iterations 




Fig. 6 Single trial of Setting B. Left: Number of CG iterations (divided by N) plotted against the outer iteration number 
for CG-IRLS-A (blue, x), PGG-IRLS-A (dark blue, +), PCGm-IRLS-A (black, *). Right: Tolerance tol„ plotted against 
the outer iteration number for the same algorithms. 


Empirical test on computational time and failure rate with noisy data In the previous paragraph, we 
observed that the CG-IRLS-A methods are only computing efficiently solutions with a low relative error. 
Thus we now focus on this setting and compare the three methods PCG-IRLS-A, PCGm-IRLS-A, and 
FISTA with respect to their computational time and failure rate in recovering solutions with a relative 
error of Ie-1, 1e-2, and 1e-3. We only consider the convex case r = 1. Similarly to the procedure in 
Section |5.2[ we run these algorithms on 100 trials for each setting with the respectively chosen values of 
A. In Figurej^the upper bar plot shows the result for the mean computational time and the lower stacked 
bar plot shows how often a method was the fastest one. We do not present a plot of the failnre rate since 
none of the methods failed at all. By means of the plots, we demonstrate that both PCG-IRLS-A, and 
PCGm-IRLS-A are faster than FISTA, while PCGm-IRLS-A always performs best. 
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Fig. 7 Empirical test on Setting A, B, and G for the methods PCG-IRLS-A (blue), PCGm-IRLS-A (black), and FISTA 
(red). Upper: Mean computational time. Lower: Fastest method (in %). 
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Empirical test on computational time and failure rate with noiseless data In the noiseless case, we 
compare the computational time of FISTA and PCGm-IRLS-A to IHT and IHT+CG-IRLSm. We set 
maxiter_cg = 40 for PGGm-IRLS-A. In a first test, we run these algorithms on one trial of Setting A, 
and G respectively, and plot the results in Figure]^ 

As already mentioned, FISTA is not suitable for small values of A on the order of m- 1e- 8 and converges 
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Fig. 8 Left: Setting A. Right: Setting C. Comparison of IHT (blue, —), FISTA (green,-), IHT+CG-IRLSm (black, *), 

and PCGm-IRLS-A (red, x). 


then extremely slowly, but PGGm-IRLS-A can compete with the remaining methods. IHT+GG-IRLSm 
is in some settings able to outperform IHT, at least when a high accuracy is needed. PGGm-IRLS-A is 
always at least as fast as IHT with increasing relative performance gain for increasing dimensions. This 
observation suggests the conjecture that PGGm-IRLS-A provides the fastest method also in rather high 
dimensional problems. To validate this hypothesis numerically, we introduce two new high dimensional 
settings (to reach higher dimensionalities and retaining low computation times for the extensive tests it 
is again very beneficial to use the real cosine transform as a model for (p): 



Setting D 

Setting E 

N 

100000 

1000000 

m 

40000 

400000 

k 

1500 

15000 

K 

2500 

25000 


We run the most promising algorithms IHT and PGGm-IRLS-A on a trial of the large scale settings D 
and E. The result, which is plotted in Figure]^ shows that PGGm-IRLS-A is able to outperform IHT in 
these settings unless one requires an extremely low relative error Ie- 8), because of the error saturation 
effect. We confirm this outcome in a test on 100 trials for Setting D and E and present the result in 
Figure 


Dependence onr. In the last experiment of this paper, we are interested in the influence of the parameter 
T. Of course, changing r also means modifying the problem resulting in a different minimizer. Due to 
non-convexity also spurious local minimizers may appear. Therefore, we do not compare the speed of the 
method to FISTA. In Figure we show the performance of Algorithm PGGm-IRLS-A for a single trial 
of Setting G and the parameters r S {1,0.9,0.8,0.7} for the noisy and noiseless setting. As reference for 
the error analysis, we choose the sparse synthetic solution +*, which is actually not the minimizer here. 

In both the noisy and noiseless setting, using a parameter t < I improves the computational time of 
the algorithm. In the noiseless case, t = 0.9 seems to be a good choice, smaller values do not improve 
the performance. In contrast, in the noisy setting the computational time decreases with decreasing r. 
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Fig. 9 Left: Setting D. Right: Setting E. Comparison of IHT (blue, —), and PCGm-IRLS-A (red, x). 
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Fig. 10 Empirical test on the mean computational time of Setting D and E for the methods IHT (blue), and PCGm-IRLS-A 
(red). 
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Fig. 11 Results of Algorithm PCGm-IRLS-A for a single trial of Setting C for different values of r with noise (right) and 
without noise (left). 


A Proof of Lemma 1101 


(in the case 0 < r ^ 1) 

Let X = or X £ and rj G arbitrary. Consider the function 

Ge,T{t) \= fe,T (x -h tr}) - fe,T (x) 


with its first derivative 


Now G£,r(0) = 0 and from the minimization property of fs^r{x), Gg^rit) > 0. Therefore, 
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“<^=”(only in the case r = 1) 

Now let X G and ^ = 0 for all r] £ J\f^. We want to show that x is the minimizer of in 

Consider the convex univariate function g{u) := For any point uq we have from convexity that 

^ [uq + [no + - «o) 

because the right-hand-side is the linear function which is tangent to ^ at uq. It follows, that for every point v £ T<^{y) we 
have 

N 

fe,l{-v) ^ fe,l{x) + '^[x^ + £^\~'-/^Xi{vi - Xi) = fe,l{x) + {x,V - = fe,l{x), 

i=l 

where we have used the orthogonality condition and the fact that {v — x) £ Since v was chosen arbitrarily, x = x^’^ 
as claimed. 


Acknowledgements Massimo Fornasier acknowledges the support of the ERC-Starting Grant HDSPCONTR “High- 
Dimensional Sparse Optimal Control” and the DFG Project “Optimal Adaptive Numerical Methods for p-Poisson Elliptic 
equations”. Steffen Peter acknowledges the support of the Project “SparsEO: Exploiting the Sparsity in Remote Sensing 
for Earth Observation” funded by Munich Aerospace. Holger Rauhut would like to thank the European Research Council 
(ERC) for support through the Starting Grant StG 258926 SPALORA (Sparse and Low Rank Recovery) and the Hausdorff 
Center for Mathematics at the University of Bonn where this project has started. 


References 

1. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging 
Sci. 2(1), 183-202 (2009). DOI 10.1137/080716542. URL http://dx.doi.org/10.1137/080716542 

2. Bickel, P., Ritov, Y., Tsybakov, A.: Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37(4), 1705-1732 
(2009) 

3. Blumensath, T., Davies, M.E.: Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 
27(3), 265-274 (2009). DOI 10.1016/j.acha.2009.04.002. URL http://dx.doi.org/10.1016/j .acha.2009.04.002 

4. Bredies, K., Lorenz, D.A.: Minimization of non-smooth, non-convex functionals by iterative thresholding. J. Optim. 
Theory Appl. 165, 78-112 (2015) 

5. Candes, E.J., J., Tao, T., Romberg, J.: Robust uncertainty principles: exact signal reconstruction from highly incom¬ 
plete frequency information. IEEE Trans. Inform. Theory 52(2), 489-509 (2006) 

6. Candes, E.J., Plan, Y.: Near-ideal model selection by ii minimization. Ann. Statist. 37(5A), 2145-2177 (2009). DOI 
10.1214/08-AOS653. URL http://dx.doi.org/10.1214/08-A0S653 

7. Candes, E.J., Tao, T.: Near optimal signal recovery from random projections: universal encoding strategies? IEEE 
Trans. Inform. Theory 52(12), 5406-5425 (2006) 

8. Chafai, D., Guedon, O., Lecue, G., Pajor, A.: Interactions between Compressed Sensing, Random Matrices and high 
Dimensional Geometry. Soc. Math. France, Paris (2012) 

9. Chambolle, A., Lions, P.L.: Image recovery via total variation minimization and related problems. Numer. Math. 
76(2), 167-188 (1997). DOI 10.1007/s002110050258. URL http://dx.doi .org/10.1007/s002110050258 

10. Chartrand, R.: Exact reconstruction of sparse signals via nonconvex minimization. Signal Processing Letters, IEEE 
14(10), 707-710 (2007). DOI 10.1109/LSP.2007.898300 

11. Chartrand, R., Staneva, V.: Restricted isometry properties and nonconvex compressive sensing. Inverse Problems 24(3), 
035,020, 14 (2008). DOI 10.1088/0266-5611/24/3/035020. URL http://dx.doi. org/10.1088/0266-5611/24/3/035020 

12. Chartrand, R., Yin, W.: Iteratively reweighted algorithms for compressive sensing. In: Acoustics, Speech and Signal 
Processing, 2008. ICASSP 2008. IEEE International Conference on, pp. 3869-3872 (2008). DOI 10.1109/ICASSP.2008. 
4518498 

13. Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic decomposition by Basis Pursuit. SIAM J. Sci. Comput. 20(1), 
33-61 (1999) 

14. Cline, A.K.: Rate of convergence of Lawson’s algorithm. Math. Comp. 26, 167-176 (1972) 

15. Cohen, A., Dahmen, W., DeVore, R.A.: Compressed sensing and best fc-term approximation. J. Amer. Math. Soc. 
22(1), 211-231 (2009) 

16. Daubechies, L, DeVore, R., Fornasier, M., Giinturk, C.: Iteratively re-weighted least squares minimization for sparse 
recovery. Comm. Pure Appl. Math. 63(1), 1-38 (2010) 

17. Dirksen, S., Lecu’e, G., Rauhut, H.: On the gap between RIP-properties and sparse recovery conditions. Preprint 
ArXiv:1504.05073 (2015) 

18. Donoho, D.L.: Compressed sensing. IEEE Trans. Inform. Theory 52(4), 1289-1306 (2006) 

19. Fornasier, M., Rauhut, H., Ward, R.: Low-rank matrix recovery via iteratively reweighted least squares minimization. 
SIAM J. Optim. 21(4), 1614-1640 (2011). DOI 10.1137/100811404. URL http://dx.doi.org/10.1137/100811404 

20. Foucart, S., Rauhut, H.: A Mathematical Introduction to Compressive Sensing. New York, NY: Birkhauser/Springer 
(2013). DOI 10.1007/978-0-8176-4948-7 

21. Gorodnitsky, I.F., Rao, B.D.: Sparse signal reconstruction from limited data using FOCUSS: a recursive weighted norm 
minimization algorithm. IEEE Transactions on Signal Processing 45(3), 600-616 (1997) 

22. Gribonval, R., Nielsen, M.; Sparse representations in unions of bases. IEEE Trans. Inform. Theory 49(12), 3320-3325 
(2003) 

23. Han, W., Jensen, S., Shimansky, L: The Kacanov method for some nonlinear problems. Appl. Numer. Math. 24(1), 
57-79 (1997) 


39 


24. Hestenes, M.R., Stiefel, E.: Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the 
National Bureau of Standards 49(6), 409-436 (1952) 

25. Hollanda, P.W., Welsch, R.E.: Robust regression using iteratively reweighted least-squares. Communications in Statis¬ 
tics - Theory and Methods 6(9), 813-827 (1977) 

26. Ito, K., Kunisch, K.: A variational approach to sparsity optimization based on Lagrange multiplier theory. Inverse Prob¬ 
lems 30(1), 015,001, 23 (2014). DOI 10.1088/0266-5611/30/1/015001. URL http://dx.doi.org/10.1088/0266-5611/ 
30/1/015001 

27. Jacobs, D.A.H.: A generalization of the conjugate-gradient method to solve complex systems. IMA journal of numerical 
analysis 6(4), 447-452 (1986) 

28. Kabanava, M., Rauhut, H.: Analysis ^i-recovery with frames and Gaussian measurements. Acta Appl. Math, (to 
appear) 

29. King, J.T.: A minimal error conjugate gradient method for ill-posed problems. J. Optim. Theory Appl. 60, 297-304 
(1989). URL http://dx.doi.Org/10.1007/BF00940009 

30. Krahmer, F., Mendelson, S., Rauhut, H.: Suprema of chaos processes and the restricted isometry property. Comm. 
Pure Appl. Math. 67(11), 1877-1904 (2014) 

31. Lai, M.J., Xu, Y., Yin, W.: Improved iteratively reweighted least squares for unconstrained smoothed Iq minimization. 
SIAM Journal on Numerical Analysis 51(2), 927-257 (2013) 

32. Lawson, C.L.: Contributions to the Theory of Linear Least Maximum Approximation. Ph.D. thesis. University of 
California, Los Angeles (1961) 

33. Lecue, G., Mendelson, S.: Sparse recovery under weak moment assumptions. J. Europ. Math. Soc. (to appear) 

34. Nocedal, J., Wright, S.: Conjugate Gradient Methods, pp. 101—134. Springer Series in Operations Research and 
Financial Engineering. Springer (2006) 

35. Ochs, P., Dosovitskiy, A., Brox, T., Pock, T.: On iteratively reweighted algorithms for nonsmooth nonconvex optimiza¬ 
tion in computer vision. SIAM J. Imaging Sci. 8(1), 331—372 (2015). DOI 10.1137/140971518 

36. Osborne, M.R.: Finite algorithms in optimization and data analysis. Wiley Series in Probability and Mathematical 
Statistics: Applied Probability and Statistics. John Wiley Sz Sons, Ltd., Chichester (1985) 

37. Quarteroni, A., Sacco, R., Saleri, F.: Numerical Mathematics. Texts in Applied Mathematics Series. Springer-Verlag 
GmbH (2000). URL http://books.google.de/books?id=YVpyyilM7vUC 

38. Ramlau, R., Zarzer, C.A.: On the minimization of a Tikhonov functional with a non-convex sparsity constraint. 
Electron. Trans. Numer. Anal. 39, 476-507 (2012) 

39. Rauhut, H.: Compressive sensing and structured random matrices. In: M. Fornasier (ed.) Theoretical foundations and 
numerical methods for sparse recovery. Radon Series Comp. Appl. Math., vol. 9, pp. 1—92. deCruyter (2010) 

40. Rudelson, M., Vershynin, R.: On sparse reconstruction from Fourier and Gaussian measurements. Comm. Pure Appl. 
Math. 61, 1025-1045 (2008) 

41. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1-4), 259-268 
(1992) 

42. Vogel, C.R., Oman, M.E.: Fast, robust total variation-based reconstruction of noisy, blurred images. IEEE Trans. 
Image Process. 7(6), 813-824 (1998). DOI 10.1109/83.679423 

43. Voronin, S.: Regularization of linear systems with sparsity constraints with applications to large scale inverse problems. 
Ph.D. thesis. Applied and Computational Mathematics Department, Princeton University (2012) 

44. Voronin, S., Daubechies, I.: An Iteratively Reweighted Least Squares Algorithm for Sparse Regularization. 
arXiv:1511.08970 [math] (2015) 

45. Zarzer, C.A.: On Tikhonov regularization with non-convex sparsity constraints. Inverse Problems 25(2), 025,006, 13 
(2009). DOI 10.1088/0266-5611/25/2/025006. URL http://dx.doi.Org/10.1088/0266-5611/25/2/025006 


40 


